I'm building a website in django that needs to extract key words from short (twitter-like) messages.
I've looked at packages like topia.textextract and nltk - but both seem to be overkill for what I need to do. All I need to do is filter words like "and", "or", "not" while keeping nouns and verbs that aren't conjunctives or other parts of speech. Are there any "simpler" packages out there that can do this?
EDIT: This needs to be done in near real-time on a production website, so using a keyword extraction service seems out of the question, based on their response times and request throttling.
You can make a set sw of the "stop words" you want to eliminate (maybe copy it once and for all from the stop words corpus of NLTK, depending how familiar you are with the various natural languages you need to support), then apply it very simply.
E.g., if you have a list of words sent that make up the sentence (shorn of punctuation and lowercased, for simplicity), [word for word in sent if word not in sw] is all you need to make a list of non-stopwords -- could hardly be easier, right?
To get the sent list in the first place, using the re module from the standard library, re.findall(r'\w+', sentstring) might suffice if sentstring is the string with the sentence you're dealing with -- it doesn't lowercase, but you can change the list comprehension I suggest above to [word for word in sent if word.lower() not in sw] to compensate for that and (btw) keep the word's original case, which may be useful.
Abbreviations like NO for navigation officer or OR for operations room need a little care lest you cause a SNAFU ;-) One suspects that better results could be obtained from "Find the NO and send her to the OR" by tagging the words with parts of speech using the context ... hint 1: "the OR" should result in "the [noun]" not "the [conjunction]". Hint 2: if in doubt about a word, keep it as a keyword.
Related
I am working on a machine learning chatbot project which uses google's speech recognition api.
Now my problem is, when I say 2 or more sentences in one command, speech recognition api returns all sentences in one string, without any fullstop or commas. As a result, it has become harder to seperate sentences. For example, if I say,
Take a photo. Tell me about today's weather. Open Google Chrome.
the speech recognition api returnes:
take a photo tell me about todays weather open Google Chrome
so, my chatbot takes this full string as one sentence.
Is there any way to extract sentences from a string like the one above?
(BTW, I am using Python)
If you are about to say multiple commands say words like "and" and split the command based on that word. Now loop through the list and pass each value to your execute function.
If the variable command stores your value split it using command.split(" and ")
I had previously answered a similar question take a look at it:
https://stackoverflow.com/a/65872940/12279129
I think you could try different approaches to solve the problem:
A Naive solution
I don't know how your system works for now but if you are just looking for some subsentences you could search in the full set of sentences if there is what you are looking for.
i.e.
input_str = "Take a photo turn on fan".lower()
if "take a photo" in input_str :
print("Just took a photo!")
if "turn on fan" in input_str :
print("Just turned the fan on!")
Ofc you could also select a separator word (like and, furthermore, ..) and use it as separator.
A more advanced solution
You could use a NLP library (i.e. spacy) and perform entity recognition so that you can isolate verbs from noun and so on.
After that you could evemtually make use of stemming and lemmatization to further generalize the recognition.
You could also perform many intermediate step with different NLP techniques like stopwords removal.
Try auto punctuation from API
Maybe you can try enabling automatic punctuation in the speech to text api and see if this works good enough for you.
That's because the Google Cloud Speech doesn't provide Natural Language Understanding and you are stuck parsing text transcripts.
You can of course create the natural language understanding component yourself, either by using simple regular expressions or using something like Rasa, but there's a smarter way, too.
Speechly provides you with everything you need to create voice user interfaces on Android, iOS or web. It returns you not only the transcript, but also actionable intents and entities that makes it a lot easier to create something a bit more complex. The best part is that it's free for up to 20 hours a month.
You can see a very simple example on how it works for instance for creating search experiences here. However, the basic idea is always the same: create a model and test that it returns correct intents for your speech input. After you are done, you integrate it to your app by looping through the returned results and whenever you get the correct intent, react in your application as needed. It's actually very simple.
You can use split method
Let your string is A
X = A.split('.')
It will make X a list which will contain items as sentences
I have a python project in mind but i'm not too sure where to start.
I want to do some text comparison between two blocks of text, I want a user to be able to input two blocks of text and the program to identify the parts that are different/not the same.
I've seen this functionality in Git - when you make a change in a repo, it shows you the changes before you commit - this makes me think that I should be able to make something with similar functionality.
Any kinda' insight would be greatly appreciated!!
EDIT:
While searching I came across this Git repo online, it's all i'm looking for! A simple GUI interface where a user can load two different files and see the similarities or differences between them!
For others looking for something similar: https://github.com/yebrahim/pydiff
From my point of view, you can take user input and store it in two strings say str1 and str2 then you can make use of split( ) method or rather word_tokenize( )(Natural Language Processing) to get all the words in the String
If you want you can also remove stopwords Here for better comparison
Now you can run a loop comparing each word and for clear perception, you can underline the words or a particular part of a word that doesn't match
I'm working with a corpus I've scraped from Twitter activist communities in order to study the modern era of community organizing. I'm trying to run these data through re.findall in order to identify the tweets focused on location. I think that using the keyword "at" may be the easiest way to accomplish this.
Basically, if the entire tweet is (for example) "all who wish 2 join, meet at city hall 3pm", my code should print out something like "meet at city hall" for that line. Is this possible, or am I fundamentally misunderstanding the utility of regex? I've only ever really used them for extracting email information previously, so I'm used to writing code like this:
match = re.findall(r'[\w\.-]+#[\w\.-]+', line)
However, attempting to exchange the '#' in the code above for an 'at' doesn't yield any results.
I'm probably not even asking the right question here. Apologies for any confusion I cause and I appreciate any and all help!
If I understand correctly, you are just trying to match a sentence with the word "at" or "#"?
This is the regex I came up with:
r'[\w\s]+(at|#)[\w\s]+\.?'
This will match any words before and after an "at" or "#".
For future reference: next time you are creating a regex, use https://regex101.com/. I find it helps a ton.
I am using spacy library to build a chat bot. How do I check if a document is a question with a certain confidence? I know how to do relevance, but not sure how to filter statements from questions.
I am looking for something like below:
spacy.load('en_core_web_lg')('Is this a question?').is_question
My first response is to suggest looking for question marks at the end of the sentence.
Otherwise, most questions start with {is, does, do, what, when, where, who, why, what, how}.
There is a more complex answer involving the inclusion of auxiliary verbs and their placement relative to the verb, but if your data is well-formed, this may be sufficient (and fast).
How can I pick tags from an article or a user's post using Python?
Is the following method ok?
Build a list of word frequency from the text and sort them.
Remove some common words and pick the top 10 words remained in the list as the tags.
If the above method is ok, what library can detect if which words are common, like "the, if, you, etc" and which are descriptive words?
Here's an article on removing stop words. The link to the stop word list in the article is broken but here's another one.
The Natural Language Toolkit offers a broad variety of methods for this kind of stuff. I can't give you hands-on advice as I'm not familiar with this subject, but I think it's worth the effort to read a few articles about this topic first before you start: just picking words from the text directly won't get you very far I think, you should probably try to find similar words to the ones for that tags already exist. And of course you need to filter out the common words of the language like "the" and stuff. Again, this Python library can help you with this, at least for a few common languages.
I'd suggest you download the Stack Overflow data dump. There you get a lot of real world posts, with appropriate tags, to test different algorithms of tag selection.
But generally I doubt it will work too well. For your own question "words" is the clear winner in word count, followed by a list of words with two appearances each, like "common", "list", "method", "pick" and "tags". Which of those would you automatically choose as tags? Also the tags you chose manually contain "python" and "context", none of which shows up with high word frequency.
Train Bayes or Fischer filter with already tagged data (e.g. with Stackoverflow data dump suggested by sth) and use it to classify new posts. I'd recommend reading excellent Programming Collective Intelligence book by Toby Segaran for more information and python examples on this topic.
Instead of blacklisting words that shouldn't be tags, why don't you instead build a whitelist of words that would make for good tags?
Start with an handful of tags that you would like to have, like Python, off-topic, football, rickroll or whatnot (depends on the kind of site you are building!) and have the system only suggest between those, then let users handpick appropriate tags and also let them type in their own tags.
When enough users suggest a tag, it gets into the pool of "known good" tags for auto suggestion -- maybe after some sort of moderation, so that you can still blacklist stupid tags like the, lolol, or typoed tags like objectoriented when you have object-oriented.
Only show few suggestions. Offer autocompletion. Limit the number of tags per item. If this will be about coding, maybe some sort of language detection system (the file linux command is not too shabby on this) will help your suggestion system.