Multi-intent natural language processing and classification

Multi-intent natural language processing and classification - python

So, I'm making my own home assistant and I'm trying to make a multi-intent classification system. However, I cannot find a way to split the query said by the user into the multiple different intents in the query.
For example:
I have my data for one of my intents (same format for all)
{"intent_name": "music.off" , "examples": ["turn off the music" , "kill
the music" , "cut the music"]}
and the query said by the user would be:
'dim the lights, cut the music and play Black Mirror on tv'
I want to split the sentence into their individual intents such as :
['dim the lights', 'cut the music', 'play black mirror on tv']
however, I can't just use re.split on the sentence with and and , as delimiters to split with as if the user asks :
'turn the lights off in the living room, dining room, kitchen and bedroom'
this will be split into
['turn the lights off in the living room', 'kitchen', 'dining room', 'bedroom']
which would not be usable with my intent detection
this is my problem, thank you in advance
UPDATE
okay so I've got this far with my code, it can get the examples from my data and identify the different intents inside as I wished however it is not splitting the parts of the original query into their individual intents and is just matching.
import nltk
import spacy
import os
import json
#import difflib
#import substring
#import re
#from fuzzysearch import find_near_matches
#from fuzzywuzzy import process
text = "dim the lights, shut down the music and play White Collar"
commands = []
def get_matches():
for root, dirs, files in os.walk("./data"):
for filename in files:
f = open(f"./data/{filename}" , "r")
file_ = f.read()
data = json.loads(file_)
choices.append(data["examples"])
for set_ in choices:
command = process.extract(text, set_ , limit=1)
commands.append(command)
print(f"all commands : {commands}")
this returns [('dim the lights') , ('turn off the music') , ('play Black Mirror')] which is the correct intents but I have no way of knowing which part of the query relates to each intent - this is the main problem
my data is as follows , very simple for now until I figure out a method:
play.json
{"intent_name": "play.device" , "examples" : ["play Black Mirror" , "play Netflix on tv" , "can you please stream Stranger Things"]}
music.json
{"intent_name": "music.off" , "examples": ["turn off the music" , "cut the music" , "kill the music"]}
lights.json
{"intent_name": "lights.dim" , "examples" : ["dim the lights" , "turn down the lights" , "lower the brightness"]}

It seems that you are mixing two problems in your questions:
Multiple independent intents within a single query (e.g. shut down the music and play White Collar)
Multiple slots (using the form-filling framework) within a single intent (e.g. turn the lights off in the living room bedroom and kitchen).
These problems are quite different. Both, however, can be formulated as word tagging problem (similar to POS-tagging) and solved with machine learning (e.g. CRF or bi-LSTM over pretrained word embeddings, predicting label for each word).
The intent labels for each word can be created using BIO notation, e.g.
shut B-music_off
down I-music_off
the I-music_off
music I-music_off
and O
play B-tv_on
White I-tv_on
Collar I-tv_on
turn B-light_off
the I-light-off
lights I-light-off
off I-light-off
in I-light-off
the I-light-off
living I-light-off
room I-light-off
bedroom I-light-off
and I-light-off
kitchen I-light-off
The model would read the sentence and predict the labels. It should be trained on at least hundreds of examples - you have to generate or mine them.
After splitting intents with model trained on such labels, you will have short texts corresponding to a unique intent each. Then for each short text you need to run the second segmentation, looking for slots. E.g. the sentence about the light can be presented as
turn B-action
the I-action
lights I-action
off I-action
in O
the B-place
living I-place
room I-place
bedroom B-place
and O
kitchen B-place
Now the BIO markup hepls much: the B-place tag separates bedroom from the living room.
Both segmentations can in principle be performed by one hierarchical end-to-end model (google semantic parsing if you want it), but I feel that two simpler taggers can work as well.

Related

Extract probabilities and labels from FARM TextClassification

I have spent a few days exploring the excellent FARM library and its modular approach to building models. The default output (result) however is very verbose, including a multiplicity of texts, values and ASCII artwork. For my research I only require the predicted labels from my NLP text classification model, together with the individual probabilities. How do I do that? I have been experimenting with nested lists/dictionaries but am unable to neatly produce a simple list of output labels and probabilities.
enter code here
# Test your model on a sample (Inference)
from farm.infer import Inferencer
from pprint import PrettyPrinter
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
basic_texts = [
# a snippet or two from Dickens
{"text": "Mr Dombey had remained in his own apartment since the death of his wife, absorbed in visions of the youth, education, and destination of his baby son. Something lay at the bottom of his cool heart, colder and heavier than its ordinary load; but it was more a sense of the child’s loss than his own, awakening within him an almost angry sorrow."},
{"text": "Soon after seven o'clock we went down to dinner, carefully, by Mrs. Jellyby's advice, for the stair-carpets, besides being very deficient in stair-wires, were so torn as to be absolute traps."},
{"text": "Walter passed out at the door, and was about to close it after him, when, hearing the voices of the brothers again, and also the mention of his own name, he stood irresolutely, with his hand upon the lock, and the door ajar, uncertain whether to return or go away."},
# from Lewis Carroll
{"text": "I have kept one for many years, and have found it of the greatest possible service, in many ways: it secures my _answering_ Letters, however long they have to wait; it enables me to refer, for my own guidance, to the details of previous correspondence, though the actual Letters may have been destroyed long ago;"},
{"text": "The Queen gasped, and sat down: the rapid journey through the air had quite taken away her breath and for a minute or two she could do nothing but hug the little Lily in silence."},
{"text": "Rub as she could, she could make nothing more of it: she was in a little dark shop, leaning with her elbows on the counter, and opposite to her was an old Sheep, sitting in an arm-chair knitting, and every now and then leaving off to look at her through a great pair of spectacles."},
# G K Chesterton
{"text": "Basil and I walked rapidly to the window which looked out on the garden. It was a small and somewhat smug suburban garden; the flower beds a little too neat and like the pattern of a coloured carpet; but on this shining and opulent summer day even they had the exuberance of something natural, I had almost said tropical. "},
{"text": "This is the whole danger of our time. There is a difference between the oppression which has been too common in the past and the oppression which seems only too probable in the future."},
{"text": "But whatever else the worst doctrine of depravity may have been, it was a product of spiritual conviction; it had nothing to do with remote physical origins. Men thought mankind wicked because they felt wicked themselves. "},
]
result = infer_model.inference_from_dicts(dicts=basic_texts)
PrettyPrinter().pprint(result)
#print(result)

All logging (incl. the ASCII artwork) is done in FARM via Python's logging framework. You can simply disable the logs up to a certain level like this at the beginning of your script:
import logging
logging.disable(logging.ERROR)
Is that what you are looking for or do you rather want to adjust the output format of the model predictions? If you only need label and probability, you could do something like this:
...
basic_texts = [
{"text": "Stackoverflow is a great community"},
{"text": "It's snowing"},
]
infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)
result = infer_model.inference_from_dicts(dicts=basic_texts)
minimal_results = []
for sample in result:
# Only extract the top 1 prediction per sample
top_pred = sample["predictions"][0]
minimal_results.append({"label": top_pred["label"], "probability": top_pred["probability"]})
PrettyPrinter().pprint(minimal_results)
infer_model.close_multiprocessing_pool()
(I left out the initial model loading etc. - see this example for more details)

split and retain only text in english that is stored in a column in python dataframe

I have a dataframe "app_final" where one column "text_content" contains text in multiple language. I will like to retain only the text that is in English in that column. Any idea how I should go about that?
I tried using the following python code to create a new column "english_text" by running each word in each text through langdetect adding only the english words to the new column. However, I got an error "LangDetectException: No features in text."
How else should I approach this issue?
for i in range(0,len(app_final['text_content'])):
for x in range(0,len(app_final['text_content'][i].split())):
english=[]
language=detect(app_final['text_content'][i].split()[x])
eng_text=np.where(language=='en',app_final['text_content'][i].split()[x],np.NaN)
english.append(eng_text)
app_final['english_text']=english
this is one example of the record that I'm trying to extract only the english text:
print(app_final['text_content'][635])
LINEのプッシュメッセージのセグメント配信が可能です。 フィルターを使って、LINE公式アカウントでメッセージ配信可能なセグメント以外の独自セグメントへのメッセージ配信可能になります。メッセージ配信先を絞ることで、LINE公式アカウントのコストの節約も可能。 LINEで自由度の高いリッチメニューが作成できます。 LINE公式アカウント上に自由度の高いリッチメニューの作成が可能になります。LINEのデフォルトでは対応していない9分割・12分割などおすすめしたい商品・ウェブページへのリンクだけ大きく表示など変則的なデザインに対応。 LINEトーク上でコレクションの内商品の一括表示が可能 LINEトーク上に設定したコレクション情報の表示が可能になります。セール・おすすめなどの独自コレクションをユーザに一括でレコメンド可能です。 LINE公式アカウント経由の購買率アップ。ユーザーの属性にあわせた特別なリッチメニュー表示・プッシュメッセージ配信が可能 KisukeはLINEを新たな販売チャネルとして活用できるECマネジメントサービスです。LINE公式アカウントの友達をセグメント化してメッセージ配信が可能になります。また、自由度の高いリッチメニューの配信も可能になります。LINEでは配信できない区分けのリッチメニューの配信が可能です。
所有しているLINE公式アカウントを上手くマーケティングに活用できていないEC事業社様に最適な選択肢です。
Kisukeの主な機能
1.プッシュ通知（LINEメッセージ配信）
Shopifyとの連携により、例えば「特定の商品を買い替えそうなタイミングの方」「注文途中でサイトから離脱したカゴ落ちユーザ」といった様々なセグメントのユーザに対してマッチしたメッセージを一斉配信することが可能になります。
2.リッチメニュー配信
画像の配置パターンやリンクエリアのカスタマイズ機能があるKisukeを使えば、様々な画像配置を試すことができ、ボタンの設置等も可能となります。LINE公式アカウントでは対応していないリッチメニューのパターンも配信可能です。
例えばこんな使い方も…
1.カゴ落ちユーザに期間限定割引クーポンを送信…メールで送るより短時間でメッセージが認識されるため、1時間限定クーポンも有効です。
2.Shopifyのフィルターと連携して、1か月前に消耗品を買ったユーザにリピート促進メッセージを送信して、リピート購入を進める。
など細分化したユーザの需要に応じてメッセージ配信が可能になります。
ご質問、ご要望等お待ちしております。
使い方、カスタマイズのご依頼など、お気軽にお問い合わせください。
Kisuke is an EC management service that can use LINE as a new sales channel. LINE official account friends can be segmented to deliver messages. In addition, a rich menu with a high degree of freedom can be distributed. Rich menus that cannot be distributed with LINE can be distributed.
This is the best choice for EC companies who have not used their LINE official accounts for marketing.
Main functions of Kisuke
Push notification (LINE message delivery)
By collaborating with Shopify, it is possible to broadcast matched messages to users in various segments such as “when it is time to buy a specific product” or “the user who dropped out of the site while ordering” It becomes.
Rich menu delivery
With Kisuke, which has an image layout pattern and link area customization function, you can try various image layouts and set buttons. Rich menu patterns not supported by LINE official accounts can also be distributed.

Since your text is split into paragraphs, you can try detecting if one paragraph is English or not using Polyglot. https://polyglot.readthedocs.io/en/latest/Installation.html
Since your Japanese text has English words within, you should make use of the most probable language in the paragraph. For example:
from polyglot.detect import Detector
text = u"""
2.リッチメニュー配信 画像の配置パターンやリンクエリアのカスタマイズ機能があるKisukeを使えば、様々な画像配置を試すことができ、ボタンの設置等も可能となります。LINE公式アカウントでは対応していないリッチメニューのパターンも配信可能です。 例えばこんな使い方も… 1.カゴ落ちユーザに期間限定割引クーポンを送信…メールで送るより短時間でメッセージが認識されるため、1時間限定クーポンも有効です。 2.Shopifyのフィルターと連携して、1か月前に消耗品を買ったユーザにリピート促進メッセージを送信して、リピート購入を進める。 など細分化したユーザの需要に応じてメッセージ配信が可能になります。
"""
if Detector(text).languages[0].name == 'Japanese':
do nothing
elif Detector(text).languages[0].name == 'English':
append into string
Repeat the process for each paragraph, then replace that cell with the new cell you've made only keeping English paragraphs.

Why is spacy failing at tokenizing a particular quotation mark?

I am running spacy on a paragraph of text and it's not extracting text in quote the same way for each, and I don't understand why that is
nlp = spacy.load("en_core_web_lg")
doc = nlp("""A seasoned TV exec, Greenblatt spent eight years as chairman of NBC Entertainment before WarnerMedia. He helped revive the broadcast network's primetime lineup with shows like "The Voice," "This Is Us," and "The Good Place," and pushed the channel to the top of the broadcast-rating ranks with 18-49-year-olds, Variety reported. He also drove Showtime's move into original programming, with series like "Dexter," "Weeds," and "Californication." And he was a key programming exec at Fox Broadcasting in the 1990s.""")
Here's the whole output:
A
seasoned
TV
exec
,
Greenblatt
spent
eight years
as
chairman
of
NBC Entertainment
before
WarnerMedia
.
He
helped
revive
the
broadcast
network
's
primetime
lineup
with
shows
like
"
The Voice
,
"
"
This
Is
Us
,
"
and
"The Good Place
,
"
and
pushed
the
channel
to
the
top
of
the
broadcast
-
rating
ranks
with
18-49-year-olds
,
Variety
reported
.
He
also
drove
Showtime
's
move
into
original
programming
,
with
series
like
"
Dexter
,
"
"
Weeds
,
"
and
"
Californication
.
"
And
he
was
a
key
programming
exec
at
Fox Broadcasting
in
the 1990s
.
The one that bothers me the most is The Good Place, which is extracted as "The Good Place. Since the quotation is part of the token, I then can't extract text in quote with a Token Matcher later on… Any idea what's going on here?

The issue isn't the tokenization (which should always split " off in this case), but the NER, which uses a statistical model and doesn't always make 100% perfect predictions.
I don't think you've shown all your code here, but from the output, I would assume you've merged entities by adding merge_entities to the pipeline. These are the resulting tokens after entities are merged, and if an entity wasn't predicted correctly, you'll get slightly incorrect tokens.
I tried the most recent en_core_web_lg and couldn't replicate these NER results, but the models for each version of spacy have slightly different results. If you haven't, try v2.2, which uses some data augmentation techniques to improve the handling of quotes.

How to recognize a music sample using Python and Gracenote?

I recently discovered the GNSDK (Gracenote SDK) that seems to provide examples in several programming languages to recognize music samples by fingerprinting them, and then to request their audio database to get the corresponding artist and song title.
But the documentation is horrible.
How can I, using Python and the GNSDK, perform a recognition of an audio sample file? There isn't any examples or tutorials in the provided docs.
Edit: I really want to use the GNSDK with Python. Don't post anything unrelated, you'll waste your time.

I ended up using ACRCloud which works very well.
Python example:
from acrcloud.recognizer import ACRCloudRecognizer
config = {
'host': 'eu-west-1.api.acrcloud.com',
'access_key': 'access key',
'access_secret': 'secret key',
'debug': True,
'timeout': 10
}
acrcloud = ACRCloudRecognizer(config)
print(acrcloud.recognize_by_file('sample of a track.wav', 0))
https://github.com/acrcloud/acrcloud_sdk_python

Keywords are: Beat Spectrum Analysis and Rhythm Detection.
This is a well know Python library can contain a solution for your question:
https://github.com/aubio/aubio
Also I recommend that you should check this page for other libraries:
https://wiki.python.org/moin/PythonInMusic
Lastly this project more Python friendly solution and easy way to start:
https://github.com/librosa/librosa
an example from Librosa to calculate tempo(beats per minute) for the song:
# Beat tracking example
from __future__ import print_function
import librosa
# 1. Get the file path to the included audio example
filename = librosa.util.example_audio_file()
# 2. Load the audio as a waveform `y`
# Store the sampling rate as `sr`
y, sr = librosa.load(filename)
# 3. Run the default beat tracker
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
print('Estimated tempo: {:.2f} beats per minute'.format(tempo))
# 4. Convert the frame indices of beat events into timestamps
beat_times = librosa.frames_to_time(beat_frames, sr=sr)
print('Saving output to beat_times.csv')
librosa.output.times_csv('beat_times.csv', beat_times)
But I have to mention that this field is a very immature field in computer science and every a new paper comes up for that. So it will be useful for you if you also follow scholars for recent discoveries.
ADDITION:
Web API Wrappers mentioned in Gracenote's official docs:
https://developer.gracenote.com/web-api#python
For Python:
https://github.com/cweichen/pygn
But as you can see this wrapper is not well documented and immature. Because of that I suggest you to use this Ruby wrapper instead of Python;
For Ruby:
https://github.com/JDiPierro/tmsapi
require 'tmsapi'
# Create Instace of the API
tms = TMSAPI::API.new :api_key => 'API_KEY_HERE'
# Get all movie showtimes for Austin Texas
movie_showings = tms.movies.theatres.showings({ :zip => "78701" })
# Print out the movie name, theatre name, and date/time of the showing.
movie_showings.each do |movie|
movie.showtimes.each do |showing|
puts "#{movie.title} is playing at '#{showing.theatre.name}' at #{showing.date_time}."
end
end
# 12 Years a Slave is playing at 'Violet Crown Cinema' at 2013-12-23T12:45.
# A Christmas Story is playing at 'Alamo Drafthouse at the Ritz' at 2013-12-23T16:00.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T11:00.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T13:40.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T16:20.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T19:00.
# American Hustle is playing at 'Violet Crown Cinema' at 2013-12-23T21:40.
If you are not comfortable with Ruby or Ruby on Rails then the only option is developing your own Python wrapper.

Just reading your headline question and because there are no examples or tutorials for GNSDK, try looking at other options,
for one:
dejavu
Audio fingerprinting and recognition algorithm implemented in Python,
see the explanation here:
Dejavu can memorize audio by listening to it once and fingerprinting
it. Then by playing a song and recording microphone input, Dejavu
attempts to match the audio against the fingerprints held in the
database, returning the song being played.
https://github.com/worldveil/dejavu
seems about right.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?

With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split

What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.