A better way to use , replace() in python - python

I built a pretty basic program ,,, that will take input in English ,, and encrypt it using random alphabets of different languages ;; And also decrypt it :-
def encrypt_decrypt():
inut = input("Text to convert ::-- ")
# feel free to replace the symbols ,, with ur own carecters or numbers or something
# u can also add numbers , and other carecters for encryption or decryption
decideing_variable = input("U wanna encrypt or decrypt ?? ,, write EN or DE ::- ")
if decideing_variable == "EN":
deep = inut.replace("a", "ᛟ").replace("b", "ᛃ").replace("c", "Ῡ").replace("d", "ϰ").replace("e", "Г").replace("f", "ξ").replace("g", "ᾫ").replace("h", "ῆ").replace("i", "₪").replace("j", "א").replace("k", "ⴽ").replace("l", "ⵞ").replace("m", "ⵥ").replace("n", "ঙ").replace("o", "Œ").replace("p", "უ").replace("q", "ক").replace("r", "ჶ").replace("s", "Ø").replace("t", "ю").replace("u", "ʧ").replace("v", "ʢ").replace("w", "ұ").replace("x", "Џ").replace("y", "န").replace("z", "໒")
print(f"\n{deep}\n")
elif decideing_variable == "DE":
un_deep = inut.replace("ᛟ", "a").replace("ᛃ", "b").replace("Ῡ", "c").replace("ϰ", "d").replace("Г", "e").replace("ξ","f").replace("ᾫ", "g").replace("ῆ", "h").replace("₪", "i").replace("א", "j").replace("ⴽ", "k").replace("ⵞ", "l").replace("ⵥ", "m").replace("ঙ", "n").replace("Œ", "o").replace("უ", "p").replace("ক", "q").replace("ჶ", "r").replace("Ø", "s").replace("ю", "t").replace("ʧ", "u").replace("ʢ", "v").replace("ұ", "w").replace("Џ", "x").replace("န", "y").replace("໒", "z")
print(f"\n{un_deep}\n")
encrypt_decrypt()
while writing this I didn't know any better way then chaining .replace() function ,,,
But I have a feeling , that this isn't the proper way to do it ,,
The code works fine .
But ,, does any one know a better way of doing this ?

It looks like you are doing a character by character replacement. The function you are looking for is string.maketrans. You can give strings of equal length to convert each character to the desired character. Here is a working example online:
# first string
firstString = "abc"
secondString = "def"
string = "abc"
print(string.maketrans(firstString, secondString))
# example dictionary
firstString = "abc"
secondString = "defghi"
string = "abc"
print(string.maketrans(firstString, secondString))
You can also look at the official documentation for further details.

You can make a dictionary for corresponding words and use this,
text = "ababdba"
translation = {'a':'ᛟ', 'b':'ᛃ', 'c':'Ῡ','d': 'ϰ','e': 'Г','f': 'ξ','g': 'ᾫ','h':'ῆ','i': '₪','j': 'א','k': 'ⴽ','l': 'ⵞ','m' :'ⵥ','n': 'ঙ','o': 'Œ','p': 'უ','q': 'ক','r': 'ჶ','s': 'Ø','t': 'ю','u': 'ʧ', 'v':'ʢ','w': 'ұ','x': 'Џ','y': 'န','z': '໒'}
def translate(text,translation):
result = []
for char in text:
result.append( translation[char] )
return "".join(result)
print(translate(text,translation))
result is
ᛟᛃᛟᛃϰᛃᛟ
This might help you.

str.translate() and str.maketrans() are built to do all of the replacements in one go.
e.g.
>>> encrypt_table = str.maketrans("abc", "ᛟᛃῩ")
>>> "an abacus".translate(encrypt_table)
'ᛟn ᛟᛃᛟῩus'
NB. not string.maketrans() which is how it used to be in Python 2, and is now outdated; Python 3 turned that into two systems, str.maketrans() for text and bytes.maketrans() for bytes. see How come string.maketrans does not work in Python 3.1?

Related

How to cast different values to an alphabet

I'm a noob student working on a computer vision project.
I'm using google trans library to translate characters extracted with teseract ocr. The extracted text is in Sinhala Language, and I need to transform them into English without translating the meaning. But when using googletrans it translates the meanings as well. I only need the letters to be translated. So I want to create a mapping file with characters in Sinhala language with corresponding values in English. I can't figure out where to start and what is needed to be done. I tried to find online resources but due to my lack of knowledge, I can not connect the dots. Please guide me through this.
Here is a sample of how it should be.
(sinhala letter) (english letters)
ට = ta
ක =ka
ර =ra
ම =ma
I think you should map all the characters using a dictionary like so :
charcters_map = {
'ට':'ta',
'ක':'ka',
'ර':'ra',
'ම':'ma'
}
then you should loop through you text like so :
for letter in text:
try:
text = text.replace(letter,characters_map[letter])
except KeyError:
pass # if a letter is not recognized it will just let it as is
As pointed out by OneMadGypsy, this overwrites the initial text, this might be a better practice :
replaced_text = ''
for letter in text:
try:
replaced_text += characters_map[letter]
except KeyError:
replaced_text += letter
this replaces all occurences of the current letter with the corresponding value in you dictionary
i hope this helps, and good luck
More links :
replace()
loop through string
As #LouisAT stated, a dict is the probably the best way to go, but I disagree with the rest of their implementation.
You could create your own str type that fully inherits from str but adds your phonetics and transliteration properties.
class sin_str(str):
#property
def __phonetics(self) -> dict:
return {'ටු':'tu',
'කා':'kaa',
'ට':'ta',
'ක':'ka',
'ර':'ra',
'ම':'ma'}
#property
def transliteration(self) -> str:
p, t = self.__phonetics, self[:]
for k,v in p.items(): t = t.replace(k, v)
return t
#use
text = sin_str('කා')
print(text.transliteration) #kaa

how do i use string.replace() to replace only when the string is exactly matching

I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes
Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.
Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.

R Regex, get string between quotations marks

So. I'm trying to extract the Document is original from the string below.
c:1:{s:7:"note";s:335:"Document is original-no need to register again";}
Two thoughts:
A little bit of work, get most components of that structure:
string <- 'c:1:{s:7:"note";s:335:"Document is original-no need to register again";}'
strcapture("(.*):(.*):(.*)",
strsplit(regmatches(string, gregexpr('(?<={)[^}]+(?=})', string, perl = TRUE))[[1]], ";")[[1]],
proto = list(s="", len=1L, x=""))
# s len x
# 1 s 7 "note"
# 2 s 335 "Document is original-no need to register again"
A simpler approach, perhaps a little more hard-coded:
regmatches(string, gregexpr('(?<=")([^;"]+)(?=")', string, perl = TRUE))[[1]]
# [1] "note"
# [2] "Document is original-no need to register again"
From here, you need to figure out how to dismiss "note" and then perhaps strsplit(.., "-") to get the substring you want.

How to replace several words using python if order may change?

I want to create a little homemade translation tool where only a specific list of sentences is translated.
I have learnt to use the replace() method but my main problem is that I am translating from English to Spanish so two problems appear:
-the order reverses many times
-sometimes a group of words is translated as just one, and also sometimes a single word has to be translated as two or more
I know how to translate word by word but that is not enough for this problem.
In this particular case I guess I have to translate whole chuncks of words.
How could I do that?
I know how to translate word by word.
I am able to define two lists, in the first one I put the original english words to be translated, and in the other one the corresponding spanish words.
Then I get the input text, split it and using two for loops I check if any of the words are present. In case they are I use replace to change them for the Spanish version.
After that I use the join method adding a space between words to get the final result.
a = (["Is", "this", "the", "most","violent","show"])
b = (["Es", "este", "el", "más", "violento", "show"])
text = "Is this the most violent show?"
text2 = text.split()
for i in range (len(a)):
for j in range ((text2.__len__())):
if a[i] == text2[j]:
text2[j] = b[i]
print ("Final text is: ", " ".join(text2))
The output is:
Final text is: Es este el más violento show?
The result is on the wrong order since "más violento show" sounds weird in Spanish, it should be instead "show más violento".
What I want to learn is to put in the array a chuncks of words like this:
a = (["most violent show"])
b= (["show más violento"])
But in that case I can't use the split tool and I am a bit lost on how to do this.
What about a more simple solution using replace and mapping:
t = {'aa': 'dd', 'bbb': 'eee', 'c c c': 'f f f'}
v = 'dd eee zz f f f'
output = v
for a, b in t.iteritems():
output = output.replace(a, b)
print(output)
# 'aa bbb zz c c c'
This is actually a fairly complicated problem (if you allow it to be)! As of writing, some other answers are perfectly fine for this particular example, so if they work, please mark one of those as the accepted answer.
First off, you should use dictionaries for this. They are a "dictionary" where you look something up (the key) and get a definition (the value).
The difficult part is being able to match parts of the input phrase to-be-translated in order to get a translated output. Our general algorithm: go through every single one of the English key words/phrases and then translate them to Spanish.
There are a few problems:
You will be translating as-you-go, meaning if your translation contains words that could be both English and Spanish, you can run into nonsense translations.
English key words might be character subsets of other key terms, e.g.: "most" -> "más", "most violent show" -> "show más violento".
You need to match case sensitivity.
I won't bother with 3 as it's not really in scope of the question and will take too long. Solving 2 is easiest: when reading the keys of the dictionary, order by length of the input key. Solving 1 is much harder: you need to know which terms have already been translated when looking at the "translation in progress."
So a complex but thorough solution for this is outlined below:
translation_dict = {
"is": "es",
"this": "este",
"the": "el",
"most violent show": "show más violento",
}
input_phrase = "Is this the most violent show?"
translations = list()
# Force the translation to be lower-case.
input_phrase = input_phrase.lower()
for key in sorted(translation_dict.keys(), key=lambda phrase: -len(phrase)):
spanish_translation = translation_dict[key]
# Code will assume all keys are lower-case.
if key in input_phrase:
input_phrase = input_phrase.replace(key, "{{{}}}".format(len(translations)))
translations.append(spanish_translation)
print(input_phrase.format(*translations))
There are yet more complex solutions if you know the max word size for a translation (i.e.: iterating n-grams where n <= m, and m is the largest group of words you expect to translate). You would iterate the n-gram for largest m first, attempting to search your translation dictionary, and decrementing n by 1 until you go through individual words to iterate.
For example, with m = 3 with input: "This is a test string.", you would get the following English phrases that you would attempt to translate.
"This is a"
"is a test"
"a test string"
"this is"
"is a"
"a test"
"test string"
"this"
"is"
"a"
"test"
"string"
This can have a performance benefit with a huge translation dictionary. I would show it but this answer is complex enough as it is.
I think you can achieve what you are looking for with the string replace method:
a = ("Is", "this", "the", "most violent show")
b = ("Es", "este", "el", "show más violento")
text = "Is this the most violent show?"
for val, elem in enumerate(a):
text = text.replace(elem, b[val])
print(text)
>>> 'Es este el show más violento?'
Also note you have a list in a tuple which is redundant.
Note Caspar Wylie's solution is a neater method using dicts instead

Python String Cleanup + Manipulation (Accented Characters)

I have a database full of names like:
John Smith
Scott J. Holmes
Dr. Kaplan
Ray's Dog
Levi's
Adrian O'Brien
Perry Sean Smyre
Carie Burchfield-Thompson
Björn Árnason
There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.
I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:
john.smith
scott.j.holmes
dr.kaplan
rays.dog
levis
adrian.obrien
perry.sean.smyre
carie.burchfieldthompson
bjorn.arnason
So far I have:
Fullname.strip() # get rid of leading/trailing white space
Fullname.lower() # make everything lower case
... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods
Take a look at this link [redacted]
Here is the code from the page
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaningful. Anything not converted is deleted.
"""
xlate = {
0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)
If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode (it's also on pypi).
The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.
With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).
The advantage of this module compared to the unicode normalization technique is this: Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.
The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.
The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.
Usage of the module is straightforward::
from unidecode import unidecode
var_utf8 = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"
Note that I have nothing to do with this module directly. It just happens that I find it very useful.
Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.
The following function is generic:
import unicodedata
def not_combining(char):
return unicodedata.category(char) != 'Mn'
def strip_accents(text, encoding):
unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
return filter(not_combining, unicode_text).encode(encoding)
# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα
Obviously, you should know the encoding of your strings.
I would do something like this
# coding=utf-8
def alnum_dot(name, replace={}):
import re
for k, v in replace.items():
name = name.replace(k, v)
return re.sub("[^a-z.]", "", name.strip().lower())
print alnum_dot(u"Frédrik Holmström", {
u"ö":"o",
" ":"."
})
Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped
The translate method allows you to delete characters. You can use that to delete arbitrary characters.
Fullname.translate(None,"'-\"")
If you want to delete whole classes of characters, you might want to use the re module.
re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)

Categories

Resources