How to make casefold() work on certain Arabic unicodes - python

I've got some issues with detecting the "equality" in Python 2.7 of some Arabic pairs of words:
أكثر vs اكثر
قائمة vs قائمه
إنشاء vs انشاء
The elements of each pair are not really identical, but they are written with different cases. An useful analogy for me (I don't know any Arabic) is Word vs word. They are not identical, but if I lowercase both of them, I'll obtain word vs word, which are identical. That's what I want to obtain from these 3 pairs of Arabic words.
I'm going to exemplify what I tried by now using the first pair (1. أكثر vs اكثر). By the way, the meaning of both Arabic words from the first pair is "menu" "more", but they have different cases (as a parallel: Menu vs menu More vs more). I don't know Arabic at all nor Arabic rules, so if someone who knows Arabic can confirm that those words are "identical" it would be great.
str1 = u'أكثر'
str2 = u'اكثر'
So what I'm trying to do is to bring str1 and str2 to the same form (if possible), so I want a function which produce the same output for both strings:
transform(str1) == transform(str2)
In English this can be achieved easily:
a = u'More'
b = u'more'
def transform(text):
return text.lower()
>>> transform(a) == transform(b)
>>> True
But, of course, this doesn't work for Arabic as there are no such things like lower case or upper case.
>>> str1
u'\u0623\u0643\u062b\u0631'
>>> str2
u'\u0627\u0643\u062b\u0631'
Note that only the first character differs in the unicode representation.
I also normalized the strings using:
import unicodedata
>>> n_str1 = unicodedata.normalize('NFKD', str1)
>>> n_str2 = unicodedata.normalize('NFKD', str2)
>>> n_str1
u'\u0627\u0654\u0643\u062b\u0631'
>>> n_str2
u'\u0627\u0643\u062b\u0631'
As you already noticed:
>>> n_str1 == n_str2
False
After that, I tried to use unicode.casefold() but it isn't available in Python 2. I've installed py2casefold library but it I didn't manage to obtain the equality between the strings. So I tried to use Python 3's unicode.casefold() but without any luck:
>>> str1.casefold() == str2.casefold()
False
>>> n_str1.casefold() == n_str2.casefold()
False
A solution for this in Python 2 would be perfect, but it would be great in Python 3 too.
Thank you.

These words are not identical: u'أكثر' and u'اكثر' are not the same. The first letter in the first word has the letter Alif with Hamazah on top of it, perhaps you couldn't notice that due to the small size of the glyph:
The first letter in the second word, however, is Alif *(from right-to-left):
And hence they don't compare equal. Each of these letters is represented by its own Unicode character code point. They don't compare equal from the perspective of the language too:
>>> u'أكثر'; u'اكثر'
u'\u0623\u0643\u062b\u0631'
u'\u0627\u0643\u062b\u0631'
They are not identical, but if I lowercase both of them, I'll obtain word vs word, which are identical. That's what I want to obtain from these 3 pairs of Arabic words.
There's no lower or upper case in Arabic. The words that you have in your hands are not the same, they have different letters. Some of the words have correct spelling while others have incorrect spelling. They may seem to be the same, but for Arabic readers they also may consider them to be the same, for language freaks, they're not the same. But they convey the meaning, your list of Arabic words in English roughly looks like this:
1- more, moore
2- menu, manu
3- establish, estblish
I'm going to exemplify what I tried by now using the first pair (1. أكثر vs اكثر). By the way, the meaning of both Arabic words from the first pair is "menu", but they have different cases (as a parallel: Menu vs menu)
No, أكثر means more. Your second pair means menu, but there's no such thing as Menu or menu in Arabic. I couldn't delve into details, because this would be off topic.

Related

How to remove spaces in a "single" word? ("bo ok" to "book")

I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)
Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\
You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:
import enchant
text = "int ernational trade is not good for economies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
fixed_text[-1] = compound_word
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.
This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.
As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.
import enchant
text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(compound_word := words[i]):
for j, pending_word in enumerate(fixed_text[::-1], 1):
if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
del fixed_text[-j:]
fixed_text.append(compound_word)
break
else:
fixed_text.append(words[i])
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))

Replace Arabic text with Python

I have a str that has Arabic characters in it
text = "صَوتُ صَفيرِ البُلْبُلِ"
I am trying to remove specific characters like ص
I tried
text.replace("ص", "")
but nothing worked.
I searched and found some blogs saying that we need to write Arabic with English but that is not pratic.
I'm not familiar with Arabic text, but I do know that Arabic letters work differently than letters in Latin/English. Nearby letters somehow affect each other, which might be the source of confusion here.
Here's what happens when you carry out the replacement (in Python 3):
text = "صَوتُ صَفيرِ البُلْبُلِ"
text2 = text.replace("ص", "")
print(text) # صَوتُ صَفيرِ البُلْبُلِ
print(text2) # وتُ َفيرِ البُلْبُلِ
The comments above are copied from the printed output. However, if you copy them back in, the copied output of text2 is in fact not identical to text2. Something is missing from the printout (this is not the case for the original text). I imagine that the resulting text2 in fact is not possible to print out correctly (i.e. in Arabic, some combinations of characters/symbols does not result in meaningful text).
Let's not rely on direct printout then, but instead consider each character at a time:
import itertools, unicodedata
text = "صَوتُ صَفيرِ البُلْبُلِ"
text2 = text.replace("ص", "")
def compare_texts(text1, text2):
for c1, c2 in itertools.zip_longest(text1, text2, fillvalue=""):
name1 = unicodedata.name(c1) if c1 else ''
name2 = unicodedata.name(c2) if c2 else ''
print(f"{name1:<20} : {name2:<20}")
compare_texts(text, text2)
From the output of the above, we see that text2 indeed is just text with ARABIC LETTER SAD ('ص') missing in two places.
In conclusion: str.replace() does what you want (or at least what you tell it to do), it just might not look like it in the (naïvely) printed output.
Bonus
Here's a short video describing how/why Arabic (and other non-Latin writing systems) are more complicated than the one.

string matching with interchangeable characters

I am trying to do a simple string matching between two strings, a small string to a bigger string. The only catch is that I want to equate two characters in the small string to be the same. In particular if there is a character 'I' or a character 'L' in the smaller string, then I want it to be considered interchangeably.
For example let's say my small string is
s = 'AKIIMP'
and then the bigger string is:
b = 'MPKGEXAKILMP'
I want to write a function that will take the two strings and checks if the smaller one is in the big one. In this particular example even though the smaller string s is not a substring in b because there is no exact match, however in my case it should match with it because like I mentioned characters 'I' and 'L' would be used interchangeably and therefore the result should find a match.
Any idea of how I could proceed with this?
s.replace('I', 'L') in b.replace('I', 'L')
will evaluate to True in your example.
You could do it with regular expressions:
import re
s = 'AKIIMP'
b = 'MPKGEXAKILMP'
p = re.sub('[IL]', '[IL]', s)
if re.search(p, b):
print(f'{s!r} is in {b!r}')
else:
print('Not found')
This is not as elegant as #Deepstop's answer, but it provides a bit more flexibility in terms of what characters you equate.

How to find find a grouping of upper and lower case characters in a string in python 3

I'm working on Python Challenges, and the level I'm on asks us to find a lower case letter surrounded on both sides by exactly three upper case letters. I've written the following code, which seems pretty crude but I feel should work. However, all I get is an empty string.
source="Hello there" #the string I have to work with
key=""#where I want to put the characters that fit
for i in source:
if i==i.lower(): # if it's uppercase
x=source.index(i) #makes number that's the index of i
if source[x-1].upper()==source[x-1] and source[x-2]==source[x-2].upper() and source[x-3].upper()==source[x-3]: #checks that the three numbers before it are upper case
if source[x+1].upper()==source[x+1] and source[x+2].upper()==source[x+2] and source[x+3].upper()==source[x+3]: #checks three numbers after are uppercase
if source[x+4].lower()==source[x=4] and source[x-4].lower()==source[x-4]: #checks that the fourth numbers are lowercase
key+=i #adds the character to key
print(key)
I know this is really, really messy but I don't understand why it just returns an empty string. If you have any idea what's wrong, or a more efficient way to do it, I would really appreciate it. Thanks
This is far, far easier with a regular expression.
re.findall(r'(?<![A-Z])[A-Z]{3}([a-z])(?=[A-Z]{3}(?:\Z|[^A-Z]))', text)
Here's how it works:
(?<![A-Z]) is a negative lookbehind assertion which makes sure that we are not preceded by an upper-case letter.
[A-Z]{3} is three upper-case letters.
([a-z]) is the lower-case letter we are looking for.
(?=[A-Z]{3}(?:\Z|[^A-Z])) is a lookahead assertion that makes sure we are followed by three upper-case letters but not four.
You will probably need to change the grouping depending on what you actually want to find. This finds the lower-case letter.
I would suggest using the itertools.groupby method with a keyfunc to distinguish lowercase from capital letters.
First you need a helper function to refactor the check logic:
def check(subseq):
return (subseq[0][0] and len(subseq[0][1]) == 3
and len(subseq[1][1]) == 1
and len(subseq[2][1]) == 3)
Then group up and check:
def findNeedle(mystr):
seq = [(k,list(g)) for k,g in groupby(mystr, str.isupper)]
for i in range(len(seq) - 2):
if check(seq[i:i+3]):
return seq[i+1][1][0]
Inspect seq in the interpreter to see how this works, it should be very clear.
Edit: Some typos, I didn't test the code.
Now a test:
>>> findNeedle("Hello there HELxOTHere")
'x'

Python, breaking up Strings

I need to make a program in which the user inputs a word and I need to do something to each individual letter in that word. They cannot enter it one letter at a time just one word.
I.E. someone enters "test" how can I make my program know that it is a four letter word and how to break it up, like make my program make four variables each variable set to a different letter. It should also be able to work with bigger and smaller words.
Could I use a for statement? Something like For letter ste that letter to a variable, but what is it was like a 20 character letter how would the program get all the variable names and such?
Do you mean something like this?
>>> s = 'four'
>>> l = list(s)
>>> l
['f', 'o', 'u', 'r']
>>>
Addendum:
Even though that's (apparently) what you think you wanted, it's probably not necessary because it's possible for a string to hold virtually any size of a word -- so a single string variable likesabove should be good enough for your program verses trying to create a bunch of separately named variables for each character. For one thing, it would be difficult to write the rest of the program because you wouldn't to know what valid variable names to use.
The reason it's OK not to have separate variable for each character is because a single string can have any number of characters in it as well as be empty. Python's built-inlen()function will return a count of the number of letters in a string if applied to one, so the result oflen(s)in the above would be4.
Any character in a string can be randomly accessed by indexing it with an integer between0andlen(s)-1inside of square brackets, so to reference the third character you would uses[2]. It's useful to think of the index as the offset or the character from the beginning of the string.
Even so, in Python using indexing is often not needed because you can also iteratively process each character in a string in aforloop without using them as shown in this simple example:
num_vowels = 0
for ch in s:
if ch in 'aeiou':
num_vowels += 1
print 'there are', num_vowels, 'vowel(s) in the string', s
Python also has many other facilities and built-ins that further help when processing strings (and in fact could simplify the above example), which you'll eventually learn as you become more familiar with the language and its many libraries.
When you iterate a string, it returns the individual characters like
for c in thestring:
print(c)
You can use this to put the letters into a list if you really need to, which will retain its order but list(string) is a better choice for that (be aware that unordered types like dict or set do not guarantee any order).
You don't have to do any of those; In Python, you can access characters of a string using square brackets:
>>> word = "word"
>>> print(word[0])
w
>>> print(word[3])
d
>>> print(len(word))
4
You don't want to assign each letter to a separate variable. Then you'd be writing the rest of your program without even being able to know how many variables you have defined! That's an even worse problem than dealing with the whole string at once.
What you instead want to do is have just one variable holding the string, but you can refer to individual characters in it with indexing. Say the string is in s, then s[0] is the first character, s[1] is the second character, etc. And you can find out how far up the numbers go by checking len(s) - 1 (because the indexes start at 0, a length 1 string has maximum index 0, a length 2 string has maximum index 1, etc).
That's much more manageable than figuring out how to generate len(s) variable names, assign them all to a piece of the string, and then know which variables you need to reference.
Strings are immutable though, so you can't assign to s[1] to change the 2nd character. If you need to do that you can instead create a list with e.g. l = list(s). Then l[1] is the second character, and you can assign l[1] = something to change the element in the list. Then when you're done you can get a new string out with s_new = ''.join(l) (join builds a string by joining together a sequence of strings passed as its argument, using the string it was invoked on to the left as a separator between each of the elements in the sequence; in this case we're joining a list of single-character strings using the empty string as a separator, so we just get all the single-character strings joined into a single string).
x = 'test'
counter = 0
while counter < len(x):
print x[counter] # you can change this to do whatever you want to with x[counter]
counter += 1

Categories

Resources