Replace Arabic text with Python

Replace Arabic text with Python - python

I have a str that has Arabic characters in it
text = "صَوتُ صَفيرِ البُلْبُلِ"
I am trying to remove specific characters like ص
I tried
text.replace("ص", "")
but nothing worked.
I searched and found some blogs saying that we need to write Arabic with English but that is not pratic.

I'm not familiar with Arabic text, but I do know that Arabic letters work differently than letters in Latin/English. Nearby letters somehow affect each other, which might be the source of confusion here.
Here's what happens when you carry out the replacement (in Python 3):
text = "صَوتُ صَفيرِ البُلْبُلِ"
text2 = text.replace("ص", "")
print(text) # صَوتُ صَفيرِ البُلْبُلِ
print(text2) # وتُ َفيرِ البُلْبُلِ
The comments above are copied from the printed output. However, if you copy them back in, the copied output of text2 is in fact not identical to text2. Something is missing from the printout (this is not the case for the original text). I imagine that the resulting text2 in fact is not possible to print out correctly (i.e. in Arabic, some combinations of characters/symbols does not result in meaningful text).
Let's not rely on direct printout then, but instead consider each character at a time:
import itertools, unicodedata
text = "صَوتُ صَفيرِ البُلْبُلِ"
text2 = text.replace("ص", "")
def compare_texts(text1, text2):
for c1, c2 in itertools.zip_longest(text1, text2, fillvalue=""):
name1 = unicodedata.name(c1) if c1 else ''
name2 = unicodedata.name(c2) if c2 else ''
print(f"{name1:<20} : {name2:<20}")
compare_texts(text, text2)
From the output of the above, we see that text2 indeed is just text with ARABIC LETTER SAD ('ص') missing in two places.
In conclusion: str.replace() does what you want (or at least what you tell it to do), it just might not look like it in the (naïvely) printed output.
Bonus
Here's a short video describing how/why Arabic (and other non-Latin writing systems) are more complicated than the one.

Related

How to remove spaces in a "single" word? ("bo ok" to "book")

I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)
Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\

You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:
import enchant
text = "int ernational trade is not good for economies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
fixed_text[-1] = compound_word
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.
This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.
As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.
import enchant
text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(compound_word := words[i]):
for j, pending_word in enumerate(fixed_text[::-1], 1):
if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
del fixed_text[-j:]
fixed_text.append(compound_word)
break
else:
fixed_text.append(words[i])
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".

I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

Coloring words based on text list using python

I got two text files d.txt containing paragraph text and phrase.txt containing multi worded phrases like State-of-the-art, counter productive, Fleet Dynamism and like some found in link below
https://en.wikipedia.org/wiki/List_of_buzzwords
I need to color font matching phrase in d.txt if it is found in phrase.txt
Efforts so far:
phrases = open("phrase.txt").readlines()
words = open("d.txt").read()
for phrase in phrases:
all_words_found = False
phrase_words = phrase.lower().split(" ")
for word in phrase_words:
if word in words:
all_words_found = True
break
if all_words_found:
print (phrase)
Expected output:
Please Help!
thanks for the help:

Update: to create html output
To change the code above to create html output add a tag around the word during replace rather than ansi. The example here will use a simple span tag
words = ["catch phrase", "codeword"]
phrase = "He said a catch phrase. And a codeword was written on a wall."
new_phrase = phrase
for word in words:
new_phrase = new_phrase.replace(i, f'<span style="color:Red;">{word}</span>')
print(new_phrase) #Rather than printing, send this wherever you want it.
Solution for inline print
However, to answer your basic question, which is how to replace a set of words in a given paragraph with the same word in a different color, try using .replace() and ansi color escape codes. This will work if you want to print out the words in your python environment.
Here's a simple example of a way to turn certain words in a line of text red:
words = ["catch phrase", "codeword"]
phrase = "He said a catch phrase. And a codeword was written on a wall."
new_phrase = phrase
for i in words:
new_phrase = new_phrase.replace(i, f'\033[91m{i}\033[0;0m')
print(new_phrase)
Here is another stackoverflow post which talks about ANSI escape codes and colors in python output: How to print colored text in Python?
ANSI escape codes are a way to change the color of your output - google them to find more options/colors.
In this example here are the codes I used:
First to change the color to red:
\033[91m
Afer setting the color, you also have to change it back or the rest of the output will also be that color:
\033[0;0m

How to make casefold() work on certain Arabic unicodes

I've got some issues with detecting the "equality" in Python 2.7 of some Arabic pairs of words:
أكثر vs اكثر
قائمة vs قائمه
إنشاء vs انشاء
The elements of each pair are not really identical, but they are written with different cases. An useful analogy for me (I don't know any Arabic) is Word vs word. They are not identical, but if I lowercase both of them, I'll obtain word vs word, which are identical. That's what I want to obtain from these 3 pairs of Arabic words.
I'm going to exemplify what I tried by now using the first pair (1. أكثر vs اكثر). By the way, the meaning of both Arabic words from the first pair is "menu" "more", but they have different cases (as a parallel: Menu vs menu More vs more). I don't know Arabic at all nor Arabic rules, so if someone who knows Arabic can confirm that those words are "identical" it would be great.
str1 = u'أكثر'
str2 = u'اكثر'
So what I'm trying to do is to bring str1 and str2 to the same form (if possible), so I want a function which produce the same output for both strings:
transform(str1) == transform(str2)
In English this can be achieved easily:
a = u'More'
b = u'more'
def transform(text):
return text.lower()
>>> transform(a) == transform(b)
>>> True
But, of course, this doesn't work for Arabic as there are no such things like lower case or upper case.
>>> str1
u'\u0623\u0643\u062b\u0631'
>>> str2
u'\u0627\u0643\u062b\u0631'
Note that only the first character differs in the unicode representation.
I also normalized the strings using:
import unicodedata
>>> n_str1 = unicodedata.normalize('NFKD', str1)
>>> n_str2 = unicodedata.normalize('NFKD', str2)
>>> n_str1
u'\u0627\u0654\u0643\u062b\u0631'
>>> n_str2
u'\u0627\u0643\u062b\u0631'
As you already noticed:
>>> n_str1 == n_str2
False
After that, I tried to use unicode.casefold() but it isn't available in Python 2. I've installed py2casefold library but it I didn't manage to obtain the equality between the strings. So I tried to use Python 3's unicode.casefold() but without any luck:
>>> str1.casefold() == str2.casefold()
False
>>> n_str1.casefold() == n_str2.casefold()
False
A solution for this in Python 2 would be perfect, but it would be great in Python 3 too.
Thank you.

These words are not identical: u'أكثر' and u'اكثر' are not the same. The first letter in the first word has the letter Alif with Hamazah on top of it, perhaps you couldn't notice that due to the small size of the glyph:
The first letter in the second word, however, is Alif *(from right-to-left):
And hence they don't compare equal. Each of these letters is represented by its own Unicode character code point. They don't compare equal from the perspective of the language too:
>>> u'أكثر'; u'اكثر'
u'\u0623\u0643\u062b\u0631'
u'\u0627\u0643\u062b\u0631'
They are not identical, but if I lowercase both of them, I'll obtain word vs word, which are identical. That's what I want to obtain from these 3 pairs of Arabic words.
There's no lower or upper case in Arabic. The words that you have in your hands are not the same, they have different letters. Some of the words have correct spelling while others have incorrect spelling. They may seem to be the same, but for Arabic readers they also may consider them to be the same, for language freaks, they're not the same. But they convey the meaning, your list of Arabic words in English roughly looks like this:
1- more, moore
2- menu, manu
3- establish, estblish
I'm going to exemplify what I tried by now using the first pair (1. أكثر vs اكثر). By the way, the meaning of both Arabic words from the first pair is "menu", but they have different cases (as a parallel: Menu vs menu)
No, أكثر means more. Your second pair means menu, but there's no such thing as Menu or menu in Arabic. I couldn't delve into details, because this would be off topic.

Why regex is not working?

I need to replace all occurences of normal whitespaces in «статья 1», «статьи 2» etc. with non-breaking spaces.
The construction below works fine:
re.sub('(стат.{0,4}) (\d+)', r'\1 \2', text) # 'r' in repl is important, otherwise the word is not replaced correctly, at least for texts in Russian.
however, I do not want to repeatedly use re.sub for «статья», then for «пункт», then for the names of months, I want to have a dictionary with regex expressions and replacements. Here's my code, but it does not work as expected: 'статья 1 статьи 2' should look like 'статья(non-breaking space here)1 статьи(non-breaking space here)2':
import re
text = 'статья 1 статьи 2'
dic = {'(cтат.{0,4}) (\d+)' : r'\1 \2'}
def replace():
global text
final_text = ''
for i in dic:
new_text = re.sub(str(i), str(dic[i]), text)
text = new_text
return text
print (replace())

The problem is that you copied and pasted wrong.
This pattern works:
'(стат.{0,4}) (\d+)'
This one doesn't:
'(cтат.{0,4}) (\d+)'
Why? Because in the first one, and in your search string, that first character is a U+0441, a Cyrillic small Es. But in the second one, it's a U+0063, a Latin small C. Of course the two look identical in most fonts, but they're not the same character.
So, how can you tell? Well, when I suspected this problem, here's what I did:
>>> a = '(стат.{0,4}) (\d+)' # copied and pasted from your working code
>>> b = '(cтат.{0,4}) (\d+)' # copied and pasted from your broken code
>>> print(a.encode('unicode-escape').decode('ascii'))
(\u0441\u0442\u0430\u0442.{0,4}) (\\d+)
>>> print(b.encode('unicode-escape').decode('ascii'))
(c\u0442\u0430\u0442.{0,4}) (\\d+)
And the difference is obvious: the first one has a \u0441 escape sequence where the second one has a plain ASCII c.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.