Python text to sentences when uppercase word appears - python

I am using Google Speech-to-Text API and after I transcribe an audio file, I end up with a text which is a conversation between two people and it doesn't contain punctuation (Google's automatic punctuation or speaker diarization features are not supported for this non-English language). For example:
Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course
It appears as one big sentence, but I want to split the different sentences whenever an uppercase word appears, and thus have:
Hi you are speaking with customer support how can i help you
Hi my name is whatever and this is my problem
Can you give me your address please
Yes of course
I am using Python and I don't want to use regex, instead I want to use a simpler method. What should I add to this code in order to split each result into multiple sentences as soon as I see an uppercase letter?
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for i, result in enumerate(response.results):
transcribed_text = []
# The first alternative is the most likely one for this portion.
alternative = result.alternatives[0]
print("-" * 20)
print("First alternative of result {}".format(i))
print("Transcript: {}".format(alternative.transcript))

A simple solution would be a regex split:
inp = "Hi you are speaking with customer support how can i help you Hi my name is whatever and this is my problem Can you give me your address please Yes of course"
sentences = re.split(r'\s+(?=[A-Z])', inp)
print(sentences)
This prints:
['Hi you are speaking with customer support how can i help you',
'Hi my name is whatever and this is my problem',
'Can you give me your address please',
'Yes of course']
Note that this simple approach can easily fail should there be things like proper names in the middle of sentences, or maybe acronyms, both of which also have uppercase letters but are not markers for the actual end of the sentence. A better long term approach would be to use a library like nltk, which has the ability to find sentences with much higher accuracy.

Related

Is there a way to detect if unnecessary characters are added to strings to bypass spam detection?

I'm building a simple spam classifier and from a cursory look at my dataset, most spams put spaces in between "spammy" words, which I assume is for them to bypass spam classifier. Here's some examples:
c redi t card
mort - gage
I would like to be able to take these and encode them in my dataframe as the correct words:
credit card
mortgage
I'm using Python by the way.
This depends a lot on whether you have a list of all spam words or not.
If you do have a list of spam words and you know that there are always only ADDED spaces (e.g. give me your cred it card in formation) but never MISSING spaces (e.g. give me yourcredit cardinformation), then you could use a simple rule-based approach:
import itertools
spam_words = {"credit card", "rolex"}
spam_words_no_spaces = {"".join(s.split()) for s in spam_words}
sentence = "give me your credit car d inform ation and a rol ex"
tokens = sentence.split()
for length in range(1, len(tokens)):
for t in set(itertools.combinations(tokens, length)):
if "".join(t) in spam_words_no_spaces:
print(t)
Which prints:
> ('rol', 'ex')
> ('credit', 'car', 'd')
So first create a set of all spam words, then for an easier comparison remove all spaces (although you could adjust the method to consider only correct spacing spam words).
Then split the sentence into tokens and finally get all possible unique consequtive subsequences in the token list (including one-word sequences and the whole sentence without whitespaces), then check if they're in the list of spam words.
If you don't have a list of spam words your best chance would probably be to do general whitespace-correction on the data. Check out Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for. Also check out this thread which talks about how to add spaces to spaceless text and even mentions a python package for that. So in theory you could remove all spaces and then try to split it again into meaningful words to increase the chance the spam words are found. Generally your problem (and the oppositve, missing whitespaces) is called word boundary detection, so you might want to check some ressources on that.
Also you should be aware that modern pretrained models such as common transformer models often use sub-token-level embeddings for unknown words so that they can relatively easiely still combine what they learned for a split and a non-split version of a common word.

Data anonymization using python

I have an unstructured, free form text (taken from emails, phone conversation transcriptions), a list of first names and a list of last names.
What would be the most effective and pythonistic method to replace all the first names in the text with "--FIRSTNAME--" and last names with "--LASTNAME--" based on the lists I have?
I could iterate over each of the first name list and do a
text.replace(firstname, '--FIRSTNAME--')
but that seems very inefficient, especially for a very long list of names and many long texts to process. Are there better options?
Example:
Text: "Hello, this is David, how may I help you? Hi, my name is Alex Bender and I am trying to install my new coffee machine."
First name list: ['Abe', 'Alex', 'Andy', 'David', 'Mark', 'Timothy']
Last name list: ['Baxter', Bender', 'King', McLoud']
Expected output: "Hello, this is --FIRSTNAME--, how may I help you? Hi, my name is --FIRSTNAME-- --LASTNAME-- and I am trying to install my new coffee machine."
I followed the advice of #furas and checked out the flashtext module. This pretty much answers my need to the fullest.
I did run into a problem as I am working with Hebrew (non ASCII characters) and the text replacement would not follow word boundaries.
There is a method (add_non_word_boundary(self, character)) of class KeywordProcessor, which for some reason is not documented, that allows to add characters which are not to be considered as boundary characters (in addition to the default [a-zA-Z0-9_], allowing for whole word replacement only.

Replacing method for words with boundaries in python (like with regex)

I am seeking for a more robust replace method in python because I am building a
spellchecker to input words in ocr-context.
Let's say we have the following text in python:
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
It is easy to realize that instead of "his is a text" the right phrase would be "this is a text".
And if I do text.replace('his','this') then I replace every single 'his' for this, so I would get errors like "tthis" is a text.
When I do a replacement. I would like to replace the whole word 'this' and not his or this.
Why not trying this?
word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text
Awesome, we did it, but the problem is... what if the word to correct contains an special character like '|'. For example,
'|ights are on' instead of 'lights are one'. Trust me, it happened to me, the re.sub is a disaster in that case.
The question is, have you encountered the same problem? Is there any method to solve this? The replacement is the most
robust option.
I tried text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') and this solve a lot of things but still
have the problem of phrases like "his is a text " because the replacement doesnt work here since 'his' is at the begining of a sentence
and not ' his ' for ' this '.
Is there any replacement method in python that takes the whole word like in regexs \b word_to_correct \b
as input ?
after a few days I solved the problem that I had. I hope this could
be helpful for someone else. Let me know if you have any question or something.
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
# Asume you already have corrected your word via ocr
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'
#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
# Match word between boundaries \\b\ using regex. This will capture his and its context but not this and its context
phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
# Once you matched the context, input the new word
phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
# Now replace the old phrase (phrase2correct) with the new one *phrase_corrected
text = text.replace(phrase2correct,phrase_corrected)
return text
Test if the function works...
print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))
Output:
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.
It worked for my purpose. I hope this is helpful for someone else.

How to remove all the spaces between letters?

I have text with words like this: a n a l i z e, c l a s s etc. But there are normal words as well. I need to remove all these spaces between letters of words.
reg_let = re.compile('\s[А-Яа-яёЁa-zA-Z](\s)', re.DOTALL)
text = 'T h i s is exactly w h a t I needed'
text = re.sub(reg_let, '', text)
text
OUTPUT:
'Tiis exactlyhtneeded' (while I need - 'This is exactly what I needed')
As far as I know, there is no easy way to do it because your biggest problem is to distinct the words with meaning, in other words, you need some semantic engine to tell you which word is meaningful to the sentence.
The only thing I can think of is a word embedding model, without anything like that you can clear as much spaces as you want but you cant distinct the words, meaning you'll never know which spaces to not remove.
I would love if someone will fix me if theres a simpler way im not aware of.
There is no easy solution to this problem.
The only solution that I can think of is the one in which is used a dictionary to check if a word is correct or no (present in the english dictionary).
But even doing so you'll get a lot of false positives. For example if I got the text:
a n a n a s
the words:
a
an
as
are all correct in the english dictionary. How do I split the text? For me, as human who can read a text, it is clear that the word here is ananas. But one could split the text as such:
an an as
Which is correct grammatically, but doesn't make sense in english. The correctness is given by the context. I, as human, I can understand the context. One could split, concat the string in different ways to check if it makes sense. But unfortunately there is no library, or simple procedure that can understand context.
Machine Learning could be a way, but there is no perfect solution.

Creating a dictionary in Python and using it to translate a word

I have created a Spanish-English dictionary in Python and I have stored it using the variable translation. I want to use that variable in order to translate a text from Spanish into English. This is the code I have used so far:
from corpus.nltk import swadesh
import my_books
es2en = swadesh.entries(['es', 'en'])
translation = dict(es2en)
for sentence in my_books.sents("book_1"):
for word in my_books.words("book_1"):
if word in es2en:
print(translation, end= " ")
else:
print("unknown_word", end= " ")
print("")
My problem is that none of the words in book_1 is actually translated into English, so I get a text full of unknown word. I think I'm probably using translation in the wrong way... how could I achieve my desired result?
The .entries() method, when given more than one language, returns not a dictionary but a list of tuples. See here for an example.
You need to convert your list of pairs (2-tuples) into a dictionary. You are doing that with your translation = statement.
However, you then ignore the translation variable, and check for if word in es2en:
You need to check if the word is in translation, and subsequently look up the correct translation, instead of printing the entire dictionary.
It can be a 'Case Sensitivity' issue.
For Example:
If a dict contain a key 'Bomb' and you will look for 'bomb',
it won't be found.
Lower all the keys at es2en and then look for:word.lower() in es2en
i am in progress build a translate machine (language dictionary).
it's in bahasa (indonesia) to english and vice versa.
I build it from zero, what i'm doing is collecting all words in bahasa, and the means of the words.
then compare it with wordnet database (crawl it).
after have a group of meaning and already pairing / grouping the meaning in english with the bahasa, do this, collecting ad many as data, separate it, scienting content and daily content.
tokenize all data in to sentence, make a calculation which word is more high probabilty pairing with other word (both in bahasa and english), this is needed because every words could have several means. this calculation use to choose which word you will use.
example in bahasa:
'bisa', could means poison in bahasa and high probability pair with snake or bite
'bisa', could means can do something in bahasa, high probabilty pairing with verbs words or expression of willing to do something (verbs)
so if the tokenize result pairing with snake or bite, you search the similar meaning in answer by checking snake and poison in english. and search in english database, and you will found venom always pair with snake(have similar means with toxin / poison).
another group can do by words type (nouns, verbs, adjective, etc).
bisa == poison (noun)
bisa == can (verbs).
that's it. after have the calculation, you don't need the data base, you only need word matching data.
so the calcultaion that you can do by checking online data (ex: wikipedia) or download it or use bible/book file or any other database that contains lots of sentence.

Categories

Resources