Translate paragraph in python - python

I am trying to translate a Paragraph from english to my local language which I have written the code as:
def translate(inputvalue):
//inputvalue is an array of english paragraphs
try:
translatedData = []
trans = Translator()
for i in inputvalue:
sentence = re.sub(r'(?<=[.,])(?=[^\s])', r' ', i)
//adding space where there is no space after , or ,
t = trans.translate(sentence, src='en', dest = 'ur')
//translating from english to my local language urdu
translatedData.append(t.text)
//appending data in translatedData array
DisplayOutput.output(translatedData)
//finally calling DisplayOutput function to print translated data
The problem I am facing here is that my local language begins writing from [Right side]
and googletrans is not giving proper output. It puts periods ,commas, untranslated words at the beginning or at the end for example:
I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD.
it would translate this sentence as:
میری عمر 6 سال ہے،. مجھے کارٹون جانور اور پودے کھینچنا پسند ہےمجھے ADHD 6نہیں ہے.
As you can observe it could not translate ADHD as it is just an abbreviation it puts that at the beginning of the sentence and same goes for periods and numbers and commas.
How should I translate it so that it does not conflict like that.
If putting the sentence in another array like:
['I am', '6', 'years old', '.', 'I love to draw cartoons',',', 'animals',',', 'and plants','.', 'I do not have', 'ADHD','.']
I have no idea how to achieve this type of array but I believe it can solve the problem.
As I can translate only the parts that has English words and then appending the list in a string.
Kindly Help me generate this type of array or any other solution

string = "I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD."
arr = []
substring = ""
alpha = None
for char in string:
if char.isalpha() or char == " ": alpha = True
else: alpha = False
if substring.replace(" ","").isalpha():
if alpha:
substring += char
else:
arr.append(substring)
substring = char
else:
if alpha:
arr.append(substring)
substring = char
while " " in arr: arr.remove(" ")
while "" in arr: arr.remove("")
print(arr)
Loop through each character in the string, then check if it is a letter or not a letter with ".isalpha()". Then depending on the conditions of the current substring, you append to it or create a new one.

Related

I'm trying to solve for the other Acronyms

I have the following strings that I need to make acronyms for:
Institute of Electrical and Electronics Engineers
As Soon As Possible
University of California San Diego
Self Contained Underwater Breathing Apparatus
This is my code
my_string = input()
my_string2 = my_string.upper()
for x in range(0, 1, len(my_string2)):
print(my_string2[0::15])
but it only worked for the first input. There are three more examples that this code doesn't cover. What I need is for this code to be modified in such a way where it will create an Acronym out of any input.The first Acronym is called "Institute of Electrical and Electronics Engineers" and once it's placed into the input it returns IEEE as the output. Basically all of the first letters that are capitalized are kept and no lower cased words remain.
I'm new to programming so I bet the way I did it is a bit funky, but this worked for me on my zybooks lab:
Name = input()
AStart = Name.split()
AFinal = ''
for string in AStart:
if string[0].isupper():
AFinal += string[0] + '.'
print(AFinal)
Here's a regex based solution which looks for words that start with a capital letter and extracts their starting letter, then joins all them together to make the acronym:
import re
strings = [
'As Soon As Possible',
'Institute of Electrical and Electronics Engineers',
'University of California San Diego',
'Self Contained Underwater Breathing Apparatus'
]
for s in strings:
acronym = ''.join(re.findall(r'\b[A-Z]', s))
print(acronym)
If you don't want to use regex, you can just split the strings and test the first character of each word to see if it is uppercase:
for s in strings:
acronym = ''.join(w[0] for w in s.split(' ') if w[0].isupper())
print(acronym)
In either case the output is:
ASAP
IEEE
UCSD
SCUBA
To run from input, use this code:
import re
s = input()
acronym = ''.join(re.findall(r'\b[A-Z]', s))
print(acronym)
Or:
s = input()
acronym = ''.join(w[0] for w in s.split(' ') if w[0].isupper())
print(acronym)
Demo on ideone.com
try this:
full_string = input("Enter Text: ")
string_list = full_string.split()
acronym = ""
for string in string_list:
acronym += f"{string[0].upper()}"
print(acronym)
output:
Enter Text: This is a long string please be kind
TIALSPBK
This is what I used.
phrase = str(input()).rstrip() #gets the phrase and makes string sanitized
for char in phrase: #goes through every char
x = char #Did this to make it easier to keep track
if x.isupper() == True: #The char loop check if the value is true or not
print(x, end='') #print the true uppercase, end print on 1 line.

Capitalize the first word of a sentence in a text

I want to make sure that each sentence in a text starts with a capital letter.
E.g. "we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken." should become
"We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken."
I tried using split() to split the sentence. Then, I capitalized the first character of each line. I appended the rest of the string to the capitalized character.
text = input("Enter the text: \n")
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() # capitalize the first word of sentence
for i in range(1, len(line)):
a = a + line[i]
print(a)
I want to obtain "We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken."
I get "We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister
The good news is they tasted like chicken."
This code should work:
text = input("Enter the text: \n")
lines = text.split('. ') # Split the sentences
for index, line in enumerate(lines):
lines[index] = line[0].upper() + line[1:]
print(". ".join(lines))
The error in your code is that str.split(chars) removes the splitting delimiter char and that's why the period is removed.
Sorry for not providing a thorough description as I cannot think of what to say. Please feel free to ask in comments.
EDIT: Let me try to explain what I did.
Lines 1-2: Accepts the input and splits into a list by '. '. On the sample input, this gives: ['"We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister', 'the good news is they tasted like chicken.']. Note the period is gone from the first sentence where it was split.
Line 4: enumerate is a generator and iterates through an iterator returning the index and item of each item in the iterator in a tuple.
Line 5: Replaces the place of line in lines with the capital of the first character plus the rest of the line.
Line 6: Prints the message. ". ".join(lines) basically reverses what you did with split. str.join(l) takes a iterator of strings, l, and sticks them together with str between all the items. Without this, you would be missing your periods.
When you split the string by ". " that removes the ". "s from your string and puts the rest of it into a list. You need to add the lost periods to your sentences to make this work.
Also, this can result in the last sentence to have double periods, since it only has "." at the end of it, not ". ". We need to remove the period (if it exists) at the beginning to make sure we don't get double periods.
text = input("Enter the text: \n")
output = ""
if (text[-1] == '.'):
# remove the last period to avoid double periods in the last sentence
text = text[:-1]
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() # capitalize the first word of sentence
for i in range(1, len(line)):
a = a + line[i]
a = a + '.' # add the removed period
output = output + a
print (output)
We can also make this solution cleaner:
text = input("Enter the text: \n")
output = ""
if (text[-1] == '.'):
# remove the last period to avoid double periods in the last sentence
text = text[:-1]
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() + line [1:] + '.'
output = output + a
print (output)
By using str[1:] you can get a copy of your string with the first character removed. And using str[:-1] will give you a copy of your string with the last character removed.
split splits the string AND none of the new strings contain the delimiter - or the string/character you split by.
change your code to this:
text = input("Enter the text: \n")
lines = text.split('. ') #Split the sentences
final_text = ". ".join([line[0].upper()+line[1:] for line in lines])
print(final_text)
The below can handle multiple sentence types (ending in ".", "!", "?", etc...) and will capitalize the first word of each of the sentences. Since you want to keep your existing capital letters, using the capitalize function will not work (since it will make none sentence starting words lowercase). You can throw a lambda function into the list comp to take advantage of upper() on the first letter of each sentence, this keeps the rest of the sentence completely un-changed.
import re
original_sentence = 'we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken.'
val = re.split('([.!?] *)', original_sentence)
new_sentence = ''.join([(lambda x: x[0].upper() + x[1:])(each) if len(each) > 1 else each for each in val])
print(new_sentence)
The "new_sentence" list comprehension is the same as saying:
sentence = []
for each in val:
sentence.append((lambda x: x[0].upper() + x[1:])(each) if len(each) > 1 else each)
print(''.join(sentence))
You can use the re.sub function in order to replace all characters following the pattern . \w with its uppercase equivalent.
import re
original_sentence = 'we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken.'
def replacer(match_obj):
return match_obj.group(0).upper()
# Replace the very first characer or any other following a dot and a space by its upper case version.
re.sub(r"(?<=\. )(\w)|^\w", replacer, original_sentence)
>>> 'We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken.'

Replace a word in a String by indexing without "string replace function" -python

Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.
just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))
Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]
You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)
Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.

Python code flow does not work as expected?

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a slight problem. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far, I have been able to do this only once. When I try to keep this to continue, the program only keeps printing the first sentence my search yields. It should actually look for the longest word in this new sentence and keep applying my code flow described above.
Below is my code along with a sample input/output :
Sample input
"Thane of code"
Sample output
"Thane of code Norway himselfe , with terrible numbers , Assisted by that most disloyall Traytor , The Thane of Cawdor , began a dismall Conflict , Till that Bellona ' s Bridegroome , lapt in proofe , Confronted him with selfe - comparisons , Point against Point , rebellious Arme ' gainst Arme , Curbing his lauish spirit : and to conclude , The Victorie fell on vs"
Now this should actually take the sentence that starts with 'Norway himselfe....' and look for the longest word in it and do the steps above and so on but it doesn't. Any suggestions? Thanks.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
split_str = triggerSentence.split()#split the sentence into words
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr
How about this?
You find longest word in trigger
You find longest word in the longest sentence containing word found in 1.
The word of 1. is the longest word of the sentence of 2.
What happens? Hint: answer starts with "Infinite". To correct the problem you could find set of words in lower case to be useful.
BTW when you think MontyPython becomes False and the program finish?
Rather than searching the entire corpus each time, it may be faster to construct a single map from word to the longest sentence containing that word. Here's my (untested) attempt to do this.
import collections
from nltk.corpus import gutenberg
def words_in(sentence):
"""Generate all words in the sentence (lower-cased)"""
for word in sentence.split():
word = word.strip('.,"\'-:;')
if word:
yield word.lower()
def make_sentence_map(books):
"""Construct a map from words to the longest sentence containing the word."""
result = collections.defaultdict(str)
for book in books:
for sentence in book:
for word in words_in(sentence):
if len(sentence) > len(result[word]):
result[word] = sent
return result
def generate_random_text(sentence, sentence_map):
while True:
yield sentence
longest_word = max(words_in(sentence), key=len)
sentence = sentence_map[longest_word]
sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map):
print sentence
Mr. Hankin's answer is more elegant, but the following is more in keeping with the approach you began with:
import sys
import string
import nltk
from nltk.corpus import gutenberg
def longest_element(p):
"""return the first element of p which has the greatest len()"""
max_len = 0
elem = None
for e in p:
if len(e) > max_len:
elem = e
max_len = len(e)
return elem
def downcase(p):
"""returns a list of words in p shifted to lower case"""
return map(string.lower, p)
def unique_words():
"""it turns out unique_words was never referenced so this is here
for pedagogy"""
# there are 2.6 million words in the gutenburg corpus but only ~42k unique
# ignoring case, let's pare that down a bit
for word in gutenberg.words():
words.add(word.lower())
print 'gutenberg.words() has', len(words), 'unique caseless words'
return words
print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():
sentences.append(downcase(sentence))
trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None
while target != last_target:
matched_sentences = []
for sentence in sentences:
if target in sentence:
matched_sentences.append(sentence)
print '===', target, 'matched', len(matched_sentences), 'sentences'
longestSentence = longest_element(matched_sentences)
print ' '.join(longestSentence)
trigger = longestSentence
last_target = target
target = longest_element(trigger).lower()
Given your sample sentence though, it reaches fixation in two cycles:
$ python nltkgut.py Thane of code
loading gutenburg corpus...
=== target thane matched 24 sentences
norway himselfe , with terrible
numbers , assisted by that most
disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
=== target bridegroome matched 1 sentences
norway himselfe , with
terrible numbers , assisted by that
most disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
Part of the trouble with the response to the last problem is that it did what you asked, but you asked a more specific question than you wanted an answer to. Thus the response got bogged down in some rather complicated list expressions that I'm not sure you understood. I suggest that you make more liberal use of print statements and don't import code if you don't know what it does. While unwrapping the list expressions I found (as noted) that you never used the corpus wordlist. Functions are a help also.
You are assigning "split_str" outside of the loop, so it gets the original value and then keeps it. You need to assign it at the beginning of the while loop, so it changes each time.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#so this is run every time through the loop
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr

Problems with nested loops…

I’m going to explain to you in details of what I want to achieve.
I have 2 programs about dictionaries.
The code for program 1 is here:
import re
words = {'i':'jeg','am':'er','happy':'glad'}
text = "I am happy.".split()
translation = []
for word in text:
word_mod = re.sub('[^a-z0-9]', '', word.lower())
punctuation = word[-1] if word[-1].lower() != word_mod[-1] else ''
if word_mod in words:
translation.append(words[word_mod] + punctuation)
else:
translation.append(word)
translation = ' '.join(translation).split('. ')
print('. '.join(s.capitalize() for s in translation))
This program has following advantages:
You can write more than one sentence
You get the first letter capitalized after “.”
The program “append” the untranslated word to the output (“translation = []”)
Here is the code for program 2:
words = {('i',): 'jeg', ('read',): 'leste', ('the', 'book'): 'boka'}
max_group = len(max(words))
text = "I read the book".lower().split()
translation = []
position = 0
while text:
for m in range(max_group - 1, -1, -1):
word_mod = tuple(text[:position + m])
if word_mod in words:
translation.append(words[word_mod])
text = text[position + m:]
position += 1
translation = ' '.join(translation).split('. ')
print('. '.join(s.capitalize() for s in translation))
With this code you can translate idiomatic expressions or
“the book” to “boka”.
Here is how the program proceeds the codes.
This is the output:
1
('i',)
['jeg']
['read', 'the', 'book']
0
()
1
('read', 'the')
0
('read',)
['jeg', 'leste']
['the', 'book']
1
('the', 'book')
['jeg', 'leste', 'boka']
[]
0
()
Jeg leste boka
What I want is to implement some of the codes from program 1 into program 2.
I have tried many times with no success…
Here is my dream…:
If I change the text to the following…:
text = "I read the book. I read the book! I read the book? I read the book.".lower().split()
I want the output to be:
Jeg leste boka. Jeg leste boka! Jeg leste boka? Jeg leste boka.
So please, tweak your brain and help me with a solution…
I appreciate any reply very much!
Thank you very much in advance!
My solution flow would be something like this:
dict = ...
max_group = len(max(dict))
input = ...
textWPunc = input.lower().split()
textOnly = [re.sub('[^a-z0-9]', '', x) for x in input.lower().split()]
translation = []
while textOnly:
for m in [max_group..0]:
if textOnly[:m] in words:
check for punctuation here using textWPunc[:m]
if punctuation present in textOnly[:m]:
Append translated words + punctuation
else:
Append only translated words
textOnly = textOnly[m:]
textWPunc = textWPunc[m:]
join translation to finish
The key part being you keep two parallel lines of text, one that you check for words to translate and the other you check for punctuation if your translation search comes up with a hit. To check for punctuation, I fed the word group that I was examining into re() like so: re.sub('[a-z0-9]', '', wordGroup) which will strip out all characters but no punctuation.
Last thing was that your indexing looks kind of weird to me with that position variable. Since you're truncating the source string as you go, I'm not sure that's really necessary. Just check the leftmost x words as you go instead of using that position variable.

Categories

Resources