I'm currently working on neural text to speech, and to process the data I need several steps. One step is convert the numeric in string into english character words instead of numeral. The closest thing I can found is num2words, but I'm not sure how to apply it to an existing string. Here's my use case :
I have list of string like this
list_string = ['I spent 140 dollar yesterday','I have 3 brothers and 2 sisters']
I wanted to convert into :
output_string = ['I spent one hundred forty dollar yesterday','I have three brothers and two sisters']
The struggle is one text might consist of several number, and even if I can get the numeric using re.match, I'm not sure how to put the number back to the string.
No need to worry about floating number or year for now since I don't have that kind of number inside my string.
Thanks
There is a very quick way to do it in one line using regex to match digits and replace them in string:
from num2words import num2words
import re
list_string = ['I spent 140 dollar yesterday','I have 3 brothers and 2 sisters']
output_string = [re.sub('(\d+)', lambda m: num2words(m.group()), sentence) for sentence in list_string]
Otherwise, you can iterate through the words contained in each sentence and replace them in case they are numbers. Please see the code below:
from num2words import num2words
list_string = ['I spent 140 dollar yesterday','I have 3 brothers and 2 sisters']
output_string = []
for sentence in list_string:
output_sentence = []
for word in sentence.split():
if word.isdigit():
output_sentence.append(num2words(word))
else:
output_sentence.append(word)
output_string.append(' '.join(output_sentence))
print(output_string)
# Output
# ['I spent one hundred and forty dollar yesterday', 'I have three brothers and two sisters']
I have tried it doing it in bash, the script looks like this.
file name be conver2words.sh , invoke this from your python script.
digits=(
"" one two three four five six seven eight nine
ten eleven twelve thirteen fourteen fifteen sixteen seventeen eightteen nineteen
)
tens=("" "" twenty thirty forty fifty sixty seventy eighty ninety)
units=("" thousand million billion trillion)
number2words() {
local -i number=$((10#$1))
local -i u=0
local words=()
local group
while ((number > 0)); do
group=$(hundreds2words $((number % 1000)) )
[[ -n "$group" ]] && group="$group ${units[u]}"
words=("$group" "${words[#]}")
((u++))
((number = number / 1000))
done
echo "${words[*]}"
}
hundreds2words() {
local -i num=$((10#$1))
if ((num < 20)); then
echo "${digits[num]}"
elif ((num < 100)); then
echo "${tens[num / 10]} ${digits[num % 10]}"
else
echo "${digits[num / 100]} hundred $("$FUNCNAME" $((num % 100)) )"
fi
}
with_commas() {
# sed -r ':a;s/^([0-9]+)([0-9]{3})/\1,\2/;ta' <<<"$1"
# or, with just bash
while [[ $1 =~ ^([0-9]+)([0-9]{3})(.*) ]]; do
set -- "${BASH_REMATCH[1]},${BASH_REMATCH[2]}${BASH_REMATCH[3]}"
done
echo "$1"
}
for arg; do
[[ $arg == *[^0-9]* ]] && result="NaN" || result=$(number2words "$arg")
printf "%s\t%s\n" "$(with_commas "$arg")" "$result"
done
In action:
$ bash ./num2text.sh 9 98 987
9 nine
98 ninety eight
987 nine hundred eighty seven
you can check if the string has the number and call this script to get the words of it.
Adding the python code for this, this is a draft you will get an idea
list_string = ['I spent 140 dollar yesterday','I have 3 brothers and 2 sisters']
for str in list_string:
str_split = str.split(" ");
for word in str_split:
if word.isnumeric():
// now you know this is numberic and call the bash script from here and read the output you can also use os.system if it works instead of sub process
out = subprocess.call(['bash','conver2words.sh',word])
line = out.stdout.readline()
print(line);
else:
print(word);
Related
I have various instance of strings such as:
- hello world,i am 2000to -> hello world, i am 2000 to
- the state was 56,869,12th -> the state was 66,869, 12th
- covering.2% -> covering. 2%
- fiji,295,000 -> fiji, 295,000
For dealing with first case, I came up with two step regex:
re.sub(r"(?<=[,])(?=[^\s])(?=[^0-9])", r" ", text) # hello world, i am 20,000to
re.sub(r"(?<=[0-9])(?=[.^[a-z])", r" ", text) # hello world, i am 20,000 to
But this breaks the text in some different ways and other cases are not covered as well. Can anyone suggest a more general regex that solves all cases properly. I've tried using replace, but it does some unintended replacements which in turn raise some other problems. I'm not an expert in regex, would appreciate pointers.
This approach covers your cases above by breaking the text into tokens:
in_list = [
'hello world,i am 2000to',
'the state was 56,869,12th',
'covering.2%',
'fiji,295,000',
'and another example with a decimal 12.3not4,5 is right out',
'parrot,, is100.00% dead'
'Holy grail runs for this portion of 100 minutes,!, 91%. Fascinating'
]
tokenizer = re.compile(r'[a-zA-Z]+[\.,]?|(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?(?:%|st|nd|rd|th)?[\.,]?')
for s in in_list:
print(' '.join(re.findall(pattern=tokenizer, string=s)))
# hello world, i am 2000 to
# the state was 56,869, 12th
# covering. 2%
# fiji, 295,000
# and another example with a decimal 12.3 not 4, 5 is right out
# parrot, is 100.00% dead
# Holy grail runs for this portion of 100 minutes, 91%. Fascinating
Breaking up the regex, each token is the longest available substring with:
Only letters with or without a period or comma,[a-zA-Z]+[\.,]?
OR |
A number-ish expression which could be
1 to 3 digits \d{1,3} followed by any number of groups of comma + 3 digits (?:,\d{3})+
OR | any number of comma-free digits \d+
optionally a decimal place followed by at least one digit (?:\.\d+),
optionally a suffix (percent, 'st', 'nd', 'rd', 'th') (?:[\.,%]|st|nd|rd|th)?
optionally period or comma [\.]?
Note the (?:blah) is used to suppress re.findall's natural desire to tell you how every parenthesized group matches up on an individual basis. In this case we just want it to walk forward through the string, and the ?: accomplishes this.
I have been working on a program lately and I wanted to add a functionality where it would take in user speech such as "Show me my schedule from the next five(or 5) days" or something like that and then extract the number "Five or 5" as a number and use that in a different part of the code to request data from the google calendar, the google part is mostly done but I how do I get it to extract the numbers such as "Five" or letter based numbers, I found this code earlier when I was looking around and it only returns true or false and I'm not sure how to make it return the actual number, your help would be greatly appreciated!
import nltk
text = "Is there a one two three in there?"
def existence_of_numeric_data(text):
text=nltk.word_tokenize(text)
pos = nltk.pos_tag(text)
count = 0
for i in range(len(pos)):
word , pos_tag = pos[i]
if pos_tag == 'CD':
return True
return False
print(existence_of_numeric_data(text))
is there a way to make this release the numbers in integer format? like for example
String says "Show my schedule for the next five days"
it'll return the number "5" as a separate int
If your text is like "Contains 1 2 3" then, you can simply do the following:
for word in text.split():
if word.isdigit():
num = int(word)
It should work.
But for the text like "Contains one two three" you can make a dictionary containing the words like:
dt = ["one": 1, "two": 2, "three": 3, "four": 4, "five": 5]
and then simply search every word in this list in the given text:
for words in dt:
for w in text.split():
if w == words:
num == dt[words]
But this may be used only if you have a limited number of words. For example, if the text contains twenty and your dictionary do not have twenty then it will not work.
I have found a plugin called word2number (Install using pip) and it does the job just fine, this is how you use it
from word2number import w2n
text = "There are five days in a week"
print(w2n.word2number(text))
output>>
5
Have a string:
s = "Now is the time for all good men to come to the aid of their country. Time is of the essence friends."
I want to divide s into substrings 25 characters in length like this:
"Now is the time for all\n",
"good men to come to the\n",
"aid of their country.\n",
"Time is of the essence,\n",
"friends."
or
Now is the time for all
good men to come to the
aid of their country.
Time is of the essence,
friends.
using spaces to pad the string 'equally' starting on the left to create a substring 25 characters.
Using split() I can divide the string s into a list of lists of words 25 characters long:
d=[]
s=0
c = a.split()
for i in c:
s+=len(i)
if s <= 25:
d.append(i)
else:
s=0
d=[]
d.append(i)
result:
d=['Now is the time for all', 'good men to come to the', 'aid of their country.', 'Time is of the essence', 'friends.']
Then use this list d to build the string t
I don't understand how I can pad in between the words in each group of words to reach a length of 25. It involves some circular method but I haven't found that method yet.
EDIT
I want to extract from a sentence a sequence composed of : 1 company name, 0 or multiple numbers (in letters), and 0,1 or 2 letters from radio alphabet (alpha bravo charlie...).
There can be up to 5 numbers maximum, 2 letters maximum.
It is always a sequence for numbers and letters : number and letters are not mixed (impossible to have 'FIVE ALPHA ZERO').
No words (other than numbers for number and letters for letter) can be found in a sequence of number/letter.
So we have 1 company name, eventually 1 pack of number and then eventually 1 pack of letters.
There can be multiple occurences in one sentence.
For that I have to use groups which contain all the radio letters separated by a logical or |, same for numbers.
company.txt contains the names of companies :
AIGLE-AZUR
AIR-ALGERIE
AIR-ARABIA
sentence.txt contains 1 sentence, ex : AIR-NOSTRUM EIGHT SEVEN SIX FOUR INBOUND OVDIL HUH REACHING ONE FIVE ZERO
I tried with egrep in bash :
company = cat company.txt | tr '\n' '|'
number = "ZERO |ONE |TWO |TREE |THREE |FOUR |FIVE |SIX |SEVEN |EIGHT |NINER |NINE |TEN "
letter = "ALPHA |BRAVO |CHARLIE |DELTA |ECHO |FOXTROT |GOLF |HOTEL |INDIA |JULIET |KILO |LIMA |MIKE |NOVEMBER |OSCAR |PAPA |QUEBEC |ROMEO |SIERRA |TANGO |UNIFORM |VICTOR |WHISKEY |XRAY |YANKEE |ZULU "
egrep "($company) ($number)*($letter)*" --only-matching sentence.txt
Example sentence : AIR-NOSTRUM EIGHT SEVEN SIX FOUR INBOUND OVDIL HUH REACHING ONE FIVE ZERO
The output is : AIR-NOSTRUM EIGHT SEVEN SIX FOUR
ONE FIVE ZERO
The first result is the one expected, but why do I have "ONE FIVE ZERO" ?
It shoudl find only the first because I wanted here to extract a sequence with 1 company, 0 or mutliple numbers and 0 or multiple letters.
I also tried in python3 with the module re, with first only the numbers:
re.findall("(ONE |FIVE |ZERO )*",'HELLO ZERO ONE FIVE ZERO ALPHA BRAVO TURN LEFT FIVE ZERO')
output : ['', '', '', '', '', '', 'ZERO ', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
I want as output the sequence: ['ZERO ONE FIVE ZERO'] and the sequence ['FIVE ZERO'] (but not ['ZERO ONE FIVE ZERO FIVE ZERO'])
Is it possible to do what I am trying with the module re ?
Here I tried with only numbers but the goal is to add the company category and the letter category
Can someone explain me what I did wrong for these cases ?
The output with python re isn't at all what I expected, and with egrep I have a match which should not appear, I am very confused about that.
Thank you
It's the * that messes up your regex in python:
>>> import re
>>> s="HELLO ZERO ONE FIVE ZERO ALPHA BRAVO TURN LEFT"
>>> f=re.findall("(ONE |FIVE |ZERO )", s)
>>> f
['ZERO ', 'ONE ', 'FIVE ', 'ZERO ']
>>> t=''.join(f)
>>> t
'ZERO ONE FIVE ZERO '
Or in bash:
$ echo "HELLO ZERO ONE FIVE ZERO ALPHA BRAVO TURN LEFT" | grep -Eo '(ONE |FIVE |ZERO )' | tr -d '\n'
ZERO ONE FIVE ZERO
EDIT:
In that case you can make use of "Limiting Repetition", where the syntax is {min,max}.
>>> import re
>>> a = ["AIR-NOSTRUM EIGHT SEVEN SIX FOUR INBOUND OVDIL HUH REACHING ONE FIVE ZERO",
"AIR-NOSTRUM EIGHT SEVEN SIX FOUR ALPHA INBOUND OVDIL HUH REACHING ONE FIVE ZERO",
"AIR-NOSTRUM EIGHT SEVEN SIX FOUR NINE ALPHA MIKE INBOUND OVDIL HUH REACHING ONE FIVE ZERO",
"AIR-NOSTRUM EIGHT SIX NINE ALPHA MIKE INBOUND OVDIL HUH REACHING ONE FIVE ZERO",
"AIR-NOSTRUM MIKE INBOUND OVDIL HUH REACHING ONE FIVE ",
"EIGHT SEVEN SIX MIKE INBOUND OVDIL HUH REACHING ONE FIVE ZERO"]
>>> company="AIR-NOSTRUM|WHATEVER"
>>> number="ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE|TEN"
>>> letter="ALPHA|BRAVO|CHARLIE|DELTA|ECHO|FOXTROT|GOLF|HOTEL|INDIA|JULIET|KILO|LIMA|MIKE|NOVEMBER|OSCAR|PAPA|QUEBEC|ROMEO|SIERRA|TANGO|UNIFORM|VICTOR|WHISKEY|XRAY|YANKEE|ZULU"
>>> r="(("+company+"){0,1}[\t ]*((("+number+") ){0,5})[\t ]*(("+letter+") ){0,2})"
>>> f = []
>>> for i in a:
... t=re.findall(r, i)
... if len(t) > 0:
... if len(t[0]) > 0:
... f.append(t[0][0])
...
>>> f
['AIR-NOSTRUM EIGHT SEVEN SIX FOUR ', 'AIR-NOSTRUM EIGHT SEVEN SIX FOUR ALPHA ', 'AIR-NOSTRUM EIGHT SEVEN SIX FOUR NINE ALPHA MIKE ', 'AIR-NOSTRUM EIGHT SIX NINE ALPHA MIKE ', 'AIR-NOSTRUM MIKE ', 'EIGHT SEVEN SIX MIKE ']
You should check out regex101. This helped me a lot to learn Regex.
EDIT:
See example above. The trick is to make a group that repeats 0 to 1 times: (company a|company b){0,1}.
The fact is that in python I can't properly add the letters and company name :
re.findall("(ONE |FIVE |ZERO )(ALPHA |BRAVO )",'HELLO ZERO ONE FIVE ZERO ALPHA BRAVO TURN LEFT ONE ')
[('ZERO ', 'ALPHA ')]
>>> re.findall("(ONE |FIVE |ZERO )*(ALPHA |BRAVO )",'HELLO ZERO ONE FIVE ZERO ALPHA BRAVO TURN LEFT ONE ')
[('ZERO ', 'ALPHA '), ('', 'BRAVO ')]
I want something like ['ZERO ONE FIVE ZERO ALPHA BRAVO'] or ['ZERO', 'ONE', 'FIVE' ,'ZERO', 'ALPHA', 'BRAVO'] and not these 2 outputs.
For the exemple sentece : AIR-NOSTRUM EIGHT SEVEN SIX FOUR INBOUND OVDIL HUH REACHING ONE FIVE ZERO
I want the output to be : AIR-NOSTRUM EIGHT SEVEN SIX FOUR.
I need to use the * in regex because I can have 0 or many numbers and same for letters.
With egrep I have 2 matches but I only want the first : AIR-NOSTRUM EIGHT SEVEN SIX FOUR
EDIT: The "already answered" is not talking about what I am. My string already comes out in word format. I need to strip those words from my string into a list
I'm trying to work with phrase manipulation for a voice assistant in python.
When I speak something like:
"What is 15,276 divided by 5?"
It comes out formatted like this:
"what is fifteen thousand two hundred seventy six divided by five"
I already have a way to change a string to an int for the math part, so is there a way to somehow get a list like this from the phrase?
['fifteen thousand two hundred seventy six','five']
Go through the list of words and check each one for membership in a set of numerical words. Add adjacent words to a temporary list. When you find a non-number word, create a new list. Remember to account for non-number words at the beginning of the sentence and number words at the end of the sentence.
result = []
group = []
nums = set('one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million billion trillion quadrillion quintillion'.split())
for word in "what is fifteen thousand two hundred seventy six divided by five".split():
if word in nums:
group.append(word)
else:
if group:
result.append(group)
group = []
if group:
result.append(group)
Result:
>>> result
[['fifteen', 'thousand', 'two', 'hundred', 'seventy', 'six'], ['five']]
To join each sublist into a single string:
>>> list(map(' '.join, result))
['fifteen thousand two hundred seventy six', 'five']