Python - pyparsing unicode characters

Python - pyparsing unicode characters - python

:) I tried using w = Word(printables), but it isn't working. How should I give the spec for this. 'w' is meant to process Hindi characters (UTF-8)
The code specifies the grammar and parses accordingly.
671.assess :: अहसास ::2
x=number + "." + src + "::" + w + "::" + number + "." + number
If there is only english characters it is working so the code is correct for the ascii format but the code is not working for the unicode format.
I mean that the code works when we have something of the form
671.assess :: ahsaas ::2
i.e. it parses words in the english format, but I am not sure how to parse and then print characters in the unicode format. I need this for English Hindi word alignment for purpose.
The python code looks like this:
# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit ,
# grammar
src = Word(printables)
trans = Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
trans=""#translation string
ew=""#english word
xx=result[0]
ew=xx[2]
trans=xx[4]
edict1 = { ew:trans }
edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2
#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
trans=""#translation string
hw=""#hin word
xx=result[0]
hw=xx[2]
trans=xx[4]
#print trans
hdict1 = { trans:hw }
hdict2.update(hdict1)
print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################
def translate(d, ow, hinlist):
if ow in d.keys():#ow=old word d=dict
print ow , "exists in the dictionary keys"
transes = d[ow]
transes = transes.split()
print "possible transes for" , ow , " = ", transes
for word in transes:
if word in hinlist:
print "trans for" , ow , " = ", word
return word
return None
else:
print ow , "absent"
return None
f = open('bidir','w')
#lines = ["'\
#5# 10 # and better performance in business in turn benefits consumers . # 0 0 0 0 0 0 0 0 0 0 \
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI . # 0 0 0 0 0 0 0 0 0 0 0 \
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!##$%')
loc=0
for line in lines:
eng, hin = [subline.split(' # ')
for subline in line.strip('\n').split('\n')]
for transdict, source, dest in [(edict2, eng, hin),
(hdict2, hin, eng)]:
sourcethings = source[2].split()
for word in source[1].split():
tl = dest[1].split()
otherword = translate(transdict, word, tl)
loc = source[1].split().index(word)
if otherword is not None:
otherword = otherword.strip()
print word, ' <-> ', otherword, 'meaning=good'
if otherword in dest[1].split():
print word, ' <-> ', otherword, 'trans=good'
sourcethings[loc] = str(
dest[1].split().index(otherword) + 1)
source[2] = ' '.join(sourcethings)
eng = ' # '.join(eng)
hin = ' # '.join(hin)
f.write(eng+'\n'+hin+'\n\n\n')
f.close()
'''
if an example input sentence for the source file is:
1# 5 # modern markets : confident consumers # 0 0 0 0 0
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 0 0 0 0 0 0
!##$%
the ouptut would look like this :-
1# 5 # modern markets : confident consumers # 1 2 3 4 5
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa . # 1 2 3 4 5 0
!##$%
Output Explanation:-
This achieves bidirectional alignment.
It means the first word of english 'modern' maps to the first word of hindi 'AddhUnIk' and vice versa. Here even characters are take as words as they also are an integral part of bidirectional mapping. Thus if you observe the hindi WORD '.' has a null alignment and it maps to nothing with respect to the English sentence as it doesn't have a full stop.
The 3rd line int the output basically represents a delimiter when we are working for a number of sentences for which your trying to achieve bidirectional mapping.
What modification should i make for it to work if the I have the hindi sentences in Unicode(UTF-8) format.

Pyparsing's printables only deals with strings in the ASCII range of characters. You want printables in the full Unicode range, like this:
unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode)
if not unichr(c).isspace())
Now you can define trans using this more complete set of non-space characters:
trans = Word(unicodePrintables)
I was unable to test against your Hindi test string, but I think this will do the trick.
(If you are using Python 3, then there is no separate unichr function, and no xrange generator, just use:
unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode)
if not chr(c).isspace())
EDIT:
With the recent release of pyparsing 2.3.0, new namespace classes have been defined to give printables, alphas, nums, and alphanums for various Unicode language ranges.
import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)

As a general rule, do not process encoded bytestrings: make them into proper unicode strings (by calling their .decode method) as soon as possible, do all of your processing always on unicode strings, then, if you have to for I/O purposes, .encode them back into whatever bytestring encoding you require.
If you're talking about literals, as it seems you are in your code, the "as soon as possible" is at once: use u'...' to express your literals. In a more general case, where you're forced to do I/O in encoded form, it's immediately after input (just as it's immediately before output if you need to perform output in a specific encoded form).

I Was searching about french unicode chars and fall on this question. If you search french or other latin accents, with pyparsing 2.3.0 you can use:
>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'

Related

Translate paragraph in python

I am trying to translate a Paragraph from english to my local language which I have written the code as:
def translate(inputvalue):
//inputvalue is an array of english paragraphs
try:
translatedData = []
trans = Translator()
for i in inputvalue:
sentence = re.sub(r'(?<=[.,])(?=[^\s])', r' ', i)
//adding space where there is no space after , or ,
t = trans.translate(sentence, src='en', dest = 'ur')
//translating from english to my local language urdu
translatedData.append(t.text)
//appending data in translatedData array
DisplayOutput.output(translatedData)
//finally calling DisplayOutput function to print translated data
The problem I am facing here is that my local language begins writing from [Right side]
and googletrans is not giving proper output. It puts periods ,commas, untranslated words at the beginning or at the end for example:
I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD.
it would translate this sentence as:
میری عمر 6 سال ہے،. مجھے کارٹون جانور اور پودے کھینچنا پسند ہےمجھے ADHD 6نہیں ہے.
As you can observe it could not translate ADHD as it is just an abbreviation it puts that at the beginning of the sentence and same goes for periods and numbers and commas.
How should I translate it so that it does not conflict like that.
If putting the sentence in another array like:
['I am', '6', 'years old', '.', 'I love to draw cartoons',',', 'animals',',', 'and plants','.', 'I do not have', 'ADHD','.']
I have no idea how to achieve this type of array but I believe it can solve the problem.
As I can translate only the parts that has English words and then appending the list in a string.
Kindly Help me generate this type of array or any other solution

string = "I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD."
arr = []
substring = ""
alpha = None
for char in string:
if char.isalpha() or char == " ": alpha = True
else: alpha = False
if substring.replace(" ","").isalpha():
if alpha:
substring += char
else:
arr.append(substring)
substring = char
else:
if alpha:
arr.append(substring)
substring = char
while " " in arr: arr.remove(" ")
while "" in arr: arr.remove("")
print(arr)
Loop through each character in the string, then check if it is a letter or not a letter with ".isalpha()". Then depending on the conditions of the current substring, you append to it or create a new one.

regex - 1. add space between string, 2. ignore certain pattern

I have two things that I would like to replace in my text files.
Add " " between String end with '#' (eg. ABC#) into (eg. A B C)
Ignore certain Strings end with 'H' or 'xx:xx:xx' (eg. 1111H - ignore), (eg. if is 1111, process into 'one one one one')
so far this is my code..
import re
dest1 = r"C:\\Users\CL\\Desktop\\Folder"
files = os.listdir(dest1)
#dictionary to process Int to Str
numbers = {"0":"ZERO ", "1":"ONE ", "2":"TWO ", "3":"THREE ", "4":"FOUR ", "5":"FIVE ", "6":"SIX ", "7":"SEVEN ", "8":"EIGHT ", "9":"NINE "}
for f in files:
text= open(dest1+ "\\" + f,"r")
text_read = text.read()
#num sub pattern
text = re.sub('[%s]\s?' % ''.join(numbers), lambda x: numbers[x.group().strip()]+' ', text)
#write result to file
data = f.write(text)
f.close()
sample .txt
1111H I have 11 ABC# apples
11:12:00 I went to my# room
output required
1111H I have ONE ONE A B C apples
11:12:00 I went to M Y room
also.. i realized when I write the new result, the format gets 'messy' without the breaks. not sure why.
#current output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES
ONE ONE ONE TWO H - I WENT TO MY# ROOM
#overwritten output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES ONE ONE ONE TWO H - I WENT TO MY# ROOM

You can use
def process_match(x):
if x.group(1):
return " ".join(x.group(1).upper())
elif x.group(2):
return f'{numbers[x.group(2)] }'
else:
return x.group()
print(re.sub(r'\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b|\b([A-Za-z]+)#|([0-9])', process_match, text_read))
# => 1111H I have ONE ONE A B C apples
# 11:12:00 I went to M Y room
See the regex demo. The main idea behind this approach is to parse the string only once capturing or not parts of it, and process each match on the go, either returning it as is (if it was not captured) or converted chunks of text (if the text was captured).
Regex details:
\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b - a word boundary, and then either one or more digits and one or more uppercase letters, or three occurrences of colon-separated double digits, and then a word boundary
| - or
\b([A-Za-z]+)# - Group 1: words with # at the end: a word boundary, then oneor more letters and a #
| - or
([0-9]) - Group 2: an ASCII digit.

String search by coincidence?

I just wanted to know if there's a simple way to search a string by coincidence with another one in Python. Or if anyone knows how it could be done.
To make myself clear I'll do an example.
text_sample = "baguette is a french word"
words_to_match = ("baguete","wrd")
letters_to_match = ('b','a','g','u','t','e','w','r','d') # With just one 'e'
coincidences = sum(text_sample.count(x) for x in letters_to_match)
# coincidences = 14 Current output
# coincidences = 10 Expected output
My current method breaks the words_to_match into single characters as in letters_to_match but then it is matched as follows: "baguette is a french word" (coincidences = 14).
But I want to obtain (coincidences = 10) where "baguette is a french word" were counted as coincidences. By checking the similarity between words_to_match and the words in text_sample.
How do I get my expected output?

It looks like you need the length of the longest common subsequence (LCS). See the algorithm in the Wikipedia article for computing it. You may also be able to find a C extension which computes it quickly. For example, this search has many results, including pylcs. After installation (pip install pylcs):
import pylcs
text_sample = "baguette is a french word"
words_to_match = ("baguete","wrd")
print(pylcs.lcs2(text_sample, ' '.join(words_to_match.join))) #: 14

first, split words_to_match with
words = ''
for item in words_to_match:
words += item
letters = [] # create a list
for letter in words:
letters.append(letter)
letters = tuple(letters)
then, see if its in it
x = 0
for i in sample_text:
if letters[x] == i:
x += 1
coincidence += 1
also if it's not in sequence just do:
for i in sample_text:
if i in letters: coincidence += 1
(note that some versions of python you'l need a newline)

How can I take random Morse code Input and output Italian words only?

I'm using Python 3.7
To view the Voynich Manuscript:
To see my work check out Voynichman Forum:
Voynichman
This video also explains my work with the Voynich Manuscript:
https://www.youtube.com/watch?v=Wo2ER1Zs78U
Jason Davies Voynich Manuscript
My premise is that Wilfrid Voynich constructed the Voynich Manuscript some time in 1910
This is a little complicated to describe so bear with me. I wish to take any dot dash input from Morse code (which does not necessarily have to represent a letter in Morse code) and output Italian words only. I want the code to find the letters for me and then put them together if they recognize a word in Italian. I already have python code which takes normal Morse code input and just outputs anagrams in any language. I'm not sure if anyone who aids me here needs to use an anagram engine. I wish to have this code so I can fully decode the Voynich Manuscript.
Here is a sample narrative from Voynich to Morse to Italian translated to English:
The cipher above retains the glyph relationships to the dot and dash totals which is used to build an Italian word. Here is an example. I have to admit there are some English words.
print("Author Thomas O'Neil, copyright ver 0.1,VMS Italian Steganography
Morse Code to Anagrams, August 8, 2019")
# Python program to implement Morse Code Translator
'''
VARIABLE KEY
'cipher' -> 'stores the morse translated form of the english string'
'decipher' -> 'stores the english translated form of the morse string'
'citext' -> 'stores morse code of a single character'
'i' -> 'keeps count of the spaces between morse characters'
'message' -> 'stores the string to be encoded or decoded'
'''
# Dictionary representing the morse code chart
MORSE_CODE_DICT = { 'A':'.-', 'B':'-...',
'C':'-.-.', 'D':'-..', 'E':'.',
'F':'..-.', 'G':'--.', 'H':'....',
'I':'..', 'J':'.---', 'K':'-.-',
'L':'.-..', 'M':'--', 'N':'-.',
'O':'---', 'P':'.--.', 'Q':'--.-',
'R':'.-.', 'S':'...', 'T':'-',
'U':'..-', 'V':'...-', 'W':'.--',
'X':'-..-', 'Y':'-.--', 'Z':'--..',
'1':'.----', '2':'..---', '3':'...--',
'4':'....-', '5':'.....', '6':'-....',
'7':'--...', '8':'---..', '9':'----.',
'0':'-----', ', ':'--..--', '.':'.-.-.-',
'?':'..--..', '/':'-..-.', '-':'-....-',
'(':'-.--.', ')':'-.--.-',}
# Function to encrypt the string
# according to the morse code chart
def encrypt(message):
cipher = ''
for letter in message:
if letter != ' ':
# Looks up the dictionary and adds the
# correspponding morse code
# along with a space to separate
# morse codes for different characters
cipher += MORSE_CODE_DICT[letter] + ' '
else:
# 1 space indicates different characters
# and 2 indicates different words
cipher += ' '
return cipher
# Function to decrypt the string
# from morse to english
def decrypt(message):
# extra space added at the end to access the
# last morse code
message += ' '
decipher = ''
citext = ''
for letter in message:
# checks for space
if (letter != ' '):
# counter to keep track of space
i = 0
# storing morse code of a single character
citext += letter
# in case of space
else:
# if i = 1 that indicates a new character
i += 1
# if i = 2 that indicates a new word
if i == 2 :
# adding space to separate words
decipher += ' '
else:
# accessing the keys using their values (reverse of
encryption)
decipher += list(MORSE_CODE_DICT.keys())[list(MORSE_CODE_DICT
.values()).index(citext)]
citext = ''
return decipher
def anagrams(word):
""" Generate all of the anagrams of a word. """
if len(word) < 2:
yield word
else:
for i, letter in enumerate(word):
if not letter in word[:i]: #avoid duplicating earlier words
for j in anagrams(word[:i]+word[i+1:]):
yield j+letter
# Hard-coded driver function to run the program
while True:
def main():
message = input ("Type in Morse Code to output anagrams!: ")
result = decrypt(message)
print (result)
return result # return result
for i in anagrams(main()):
print (i)
# Executes the main function
if __name__ == '__main__':
main()

Get a file of all Italian words, load the file into a Python set, and filter output words by checking whether they appear in the set.
For example, assuming you have a file italian-words.txt with one word on each line:
italian_words = set()
with open('italian-words.txt') as f:
for line in f:
italian_words.add(line.strip())
output = []
for word in voynich_words:
if word in italian_words:
output.append(word)
print(output)

Count characters in string

So I'm trying to count anhCrawler and return the number of characters with and without spaces alone with the position of "DEATH STAR" and return it in theReport. I can't get the numbers to count correctly either. Please help!
anhCrawler = """Episode IV, A NEW HOPE. It is a period of civil war. \
Rebel spaceships, striking from a hidden base, have won their first \
victory against the evil Galactic Empire. During the battle, Rebel \
spies managed to steal secret plans to the Empire's ultimate weapon, \
the DEATH STAR, an armored space station with enough power to destroy \
an entire planet. Pursued by the Empire's sinister agents, Princess Leia\
races home aboard her starship, custodian of the stolen plans that can \
save her people and restore freedom to the galaxy."""
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR occurs and starts at position {3}.
"""
def analyzeCrawler(thetext):
numchars = 0
nospacechars = 0
numspacechars = 0
anhCrawler = thetext
word = anhCrawler.split()
for char in word:
numchars = word[numchars]
if numchars == " ":
numspacechars += 1
anhCrawler = re.split(" ", anhCrawler)
for char in anhCrawler:
nospacechars += 1
numwords = len(anhCrawler)
pos = thetext.find("DEATH STAR")
char_len = len("DEATH STAR")
ds = thetext[261:271]
dspos = "[261:271]"
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler(theReport)

You're overthinking this problem.
Number of chars in string (returns 520):
len(anhCrawler)
Number of non-whitespace characters in string (using split as using split automatically removes the whitespace, and join creates a string with no whitespace) (returns 434):
len(''.join(anhCrawler.split()))
Finding the position of "DEATH STAR" (returns 261):
anhCrawler.find("DEATH STAR")

Here, you have simplilfied version of your function:
import re
def analyzeCrawler2(thetext, text_to_search = "DEATH STAR"):
numchars = len(anhCrawler)
nospacechars = len(re.sub(r"\s+", "", anhCrawler))
numwords = len(anhCrawler.split())
dspos = anhCrawler.find(text_to_search)
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler2(theReport)
This text contains 520 characters (434 if you ignore spaces).
There are approximately 87 words in the text. The phrase
DEATH STAR occurs and starts at position 261.
I think the trick part is to remove white spaces from the string and to calculate the non-space character count. This can be done simply using regular expression. Rest should be self-explanatory.

First off, you need to indent the code that's inside a function. Second... your code can be simplified to the following:
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR is the {3}th word and starts at the {4}th character.
"""
def analyzeCrawler(thetext):
numchars = len(anhCrawler)
nospacechars = len(anhCrawler.replace(' ', ''))
numwords = len(anhCrawler.split())
word = 'DEATH STAR'
wordPosition = anhCrawler.split().index(word)
charPosition = anhCrawler.find(word)
return theReport.format(
numchars, nospacechars, numwords, wordPosition, charPosition
)
I modified the last two format arguments because it wasn't clear what you meant by dspos, although maybe it's obvious and I'm not seeing it. In any case, I included the word and char position instead. You can determine which one you really meant to include.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - pyparsing unicode characters - python

Related

Translate paragraph in python

regex - 1. add space between string, 2. ignore certain pattern

String search by coincidence?

How can I take random Morse code Input and output Italian words only?

Count characters in string

Categories

Resources