Twython not importing only English Tweets? - python

I'm using this method exactly, but when I try to specify just english with lang="en" and every other variation of that I could think of it doesn't work. This is what I'm putting in (even with keywords to limit it further) and it still isn't giving me just English. I've tried with and without keywords. I'm trying to build a 200,000+ Tweet searchable control corpus in only English for a research project and I do not want to go through that many Tweets by hand. Ideas?
>>> from nltk.twitter import Twitter
>>> tw = Twitter()
>>> tw.tweets(keywords='Delicacy, reptile, death, hold, dark, column, gifted, surgeon, brave, fashion, pearl, diamond, bent, sparkle, present, missing, shadow, holiday, glide, scanner, luster, immunity, devour, discipline, barbaric, fortunate, heart, puzzle, ache, crystal',
limit=10000, lang="en", to_screen=False)
Writing to /Users/rhiannalavalla/twitter-files/tweets.20170521-235221.json
Written 10000 Tweets

The lang option is passed to the twitter search API, so you're requesting "English" tweets. But have you used twitter? You don't have to declare the language of each and every tweet, so twitter can't restrict your results with accuracy. The lang option evidently matches the authors's choice of language for their UI, not the language of the individual tweets.
To restrict your results to tweets in English, search by hashtags and/or user ids that are likely to be of interest to English speakers only (the specifics will depend on what your corpus is for). Alternately (or perhaps in addition), you can try an automated language identification algorithm to filter out suspect tweets. The nltk comes with the langid corpus of language trigram statistics, which you could use to train a recognizer.

Related

Extract Sentences from A String

I am working on a machine learning chatbot project which uses google's speech recognition api.
Now my problem is, when I say 2 or more sentences in one command, speech recognition api returns all sentences in one string, without any fullstop or commas. As a result, it has become harder to seperate sentences. For example, if I say,
Take a photo. Tell me about today's weather. Open Google Chrome.
the speech recognition api returnes:
take a photo tell me about todays weather open Google Chrome
so, my chatbot takes this full string as one sentence.
Is there any way to extract sentences from a string like the one above?
(BTW, I am using Python)
If you are about to say multiple commands say words like "and" and split the command based on that word. Now loop through the list and pass each value to your execute function.
If the variable command stores your value split it using command.split(" and ")
I had previously answered a similar question take a look at it:
https://stackoverflow.com/a/65872940/12279129
I think you could try different approaches to solve the problem:
A Naive solution
I don't know how your system works for now but if you are just looking for some subsentences you could search in the full set of sentences if there is what you are looking for.
i.e.
input_str = "Take a photo turn on fan".lower()
if "take a photo" in input_str :
print("Just took a photo!")
if "turn on fan" in input_str :
print("Just turned the fan on!")
Ofc you could also select a separator word (like and, furthermore, ..) and use it as separator.
A more advanced solution
You could use a NLP library (i.e. spacy) and perform entity recognition so that you can isolate verbs from noun and so on.
After that you could evemtually make use of stemming and lemmatization to further generalize the recognition.
You could also perform many intermediate step with different NLP techniques like stopwords removal.
Try auto punctuation from API
Maybe you can try enabling automatic punctuation in the speech to text api and see if this works good enough for you.
That's because the Google Cloud Speech doesn't provide Natural Language Understanding and you are stuck parsing text transcripts.
You can of course create the natural language understanding component yourself, either by using simple regular expressions or using something like Rasa, but there's a smarter way, too.
Speechly provides you with everything you need to create voice user interfaces on Android, iOS or web. It returns you not only the transcript, but also actionable intents and entities that makes it a lot easier to create something a bit more complex. The best part is that it's free for up to 20 hours a month.
You can see a very simple example on how it works for instance for creating search experiences here. However, the basic idea is always the same: create a model and test that it returns correct intents for your speech input. After you are done, you integrate it to your app by looping through the returned results and whenever you get the correct intent, react in your application as needed. It's actually very simple.
You can use split method
Let your string is A
X = A.split('.')
It will make X a list which will contain items as sentences

Python NLP: identifying the tense of a sentence using TextBlob, StanfordNLP or Google Cloud

(Note: I am aware that there have been previous posts on this question (e.g. here or here, but they are rather old and I think there has been quite some progress in NLP in the past few years.)
I am trying to determine the tense of a sentence, using natural language processing in Python.
Is there an easy-to-use package for this? If not, how would I need to implement solutions in TextBlob, StanfordNLP or Google Cloud Natural Language API?
TextBlob seems easiest to use, and I manage to get the POS tags listed, but I am not sure how I can turn the output into a 'tense prediction value' or simply a best guess on the tense. Moreover, my text is in Spanish, so I would prefer to use GoogleCloud or StanfordNLP (or any other easy to use solution) which support Spanish.
I have not managed to work with the Python interface for StanfordNLP.
Google Cloud Natural Language API seems to offer exactly what I need (see here, but I have not managed to find out how I would get to this output. I have used Google Cloud NLP for other analysis (e.g. entity sentiment analysis) and it has worked, so I am confident I could set it up if I find the right example of use.
Example of textblob:
from textblob import TextBlob
from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("I am curious to see whether NLP is able to predict the tense of this sentence., pos_tagger=nltk_tagger)
print(blob.pos_tags)
-> this prints the pos tags, how would I convert them into a prediction of the tense of this sentence?
Example with Google Cloud NLP (after setting up credentials):
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
text = "I am curious to see how this works"
client = language.LanguageServiceClient()
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
tense = (WHAT NEEDS TO COME HERE?)
print(tense)
-> I am not sure about the code that needs to be entered to predict the tense (indicated in the code)
I am quite a newbie to Python so any help on this topic would be highly appreciated! Thanks!
I don't think any NLP toolkit has a function to detect past tense right away. But you can simply get it from dependency parsing and POS tagging.
Do the dependency parse of the sentence and have a look at the root which is the main predicate of the sentence and its POS tag. If it is VBD (a verb is the past simple form), it is surely past tense. If it is VB (base form) or VBG (a gerund), you need to check its dependency children and have check if there is an auxiliary verb (deprel is aux) having the VBD tag.
If you need to cover also present/past perfect or past model expressions (I must have had...), you can just extend the conditions.
In spacy (my favorite NLP toolkit for Python), you can write it like this (assuming your input is a single sentence):
import spacy
nlp = spacy.load('en_core_web_sm')
def detect_past_sentece(sentence):
sent = list(nlp(sentence).sents)[0]
return (
sent.root.tag_ == "VBD" or
any(w.dep_ == "aux" and w.tag_ == "VBD" for w in sent.root.children))
With Google Cloud API or StanfordNLP, it would be basically the same, I am just no so familiar with the API.

Twitter API - Obtain user tweets and parse into a table/database

This is a small project I'd like to get started on in the near future. It's still in the planning stage so this post is more about being steered in the right direction
Essentially, I'd like to obtain tweets from a user and parse the tweets into a table/database, with the aim to be able to run this program in real-time.
My initial plan to tackle this was to use Beautiful Soup, a Python specific library, however, I believe the Twitter API is the better approach (advice on this subject would be appreciated)
There are still 3 unknowns:
Where do I store the tweets once obtained?
How to parse the tweets?
Where to store the parsed data?
To answer (3), I suppose it depends on what I want to do with the data. I still haven't decided how I'll use the parsed data but I know that I'd like it put into categories so my thinking is probably a database/table/excel??
A few questions still to answer and I'd like you guys to steer me in the right direction. My programming language knowledge is limited to just C for now, but as this project means a great deal to me, I'm willing to put the effort in and learn the necessary languages/APIs.
What languages/APIs will I need to gain an understanding of to accomplish this project? From where I stand, it seems to be Twitter API and Python.
EDIT: So I have a basic script going which obtains a user tweets. It works better than expected. However, I'd like to take it another step. I'd like to only obtain the users' tweets if it contains a hashtag inside of the tweet. All other tweets should be ignored. How best to do this?
Here is a snippet of the basic code I have going:
import tweepy
import twitter_credentials
auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
stuff = api.user_timeline(screen_name = 'XXXXXXXXXX', count = 10, include_rts = False)
for status in stuff:
print(status.text)
Scraping Twitter (or any other social network) with for example Beautiful soup, as you said, is not a good idea for 2 reasons :
if the source pages changes (name attributes, div ids...), you have to keep your code up to date
your script can be banned because scraping is not "allowed".
To answer your questions :
1) you can store the tweets wherever you want : csv, mysql, sqlite, redis, neo4j...
2) With official API, you get JSON. Here is a Tweet Object : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html . With tweepy, for example status.text will give you the text of the tweet.
3) Same as #1. If you don't know actually what you will do with the data, store the full JSONs. You will be able later to parse them.
I suggest tweepy/python (http://www.tweepy.org/) or twit/nodejs (https://www.npmjs.com/package/twit). And read official docs : https://developer.twitter.com/en/docs/api-reference-index

Automatically pick tags from context using Python

How can I pick tags from an article or a user's post using Python?
Is the following method ok?
Build a list of word frequency from the text and sort them.
Remove some common words and pick the top 10 words remained in the list as the tags.
If the above method is ok, what library can detect if which words are common, like "the, if, you, etc" and which are descriptive words?
Here's an article on removing stop words. The link to the stop word list in the article is broken but here's another one.
The Natural Language Toolkit offers a broad variety of methods for this kind of stuff. I can't give you hands-on advice as I'm not familiar with this subject, but I think it's worth the effort to read a few articles about this topic first before you start: just picking words from the text directly won't get you very far I think, you should probably try to find similar words to the ones for that tags already exist. And of course you need to filter out the common words of the language like "the" and stuff. Again, this Python library can help you with this, at least for a few common languages.
I'd suggest you download the Stack Overflow data dump. There you get a lot of real world posts, with appropriate tags, to test different algorithms of tag selection.
But generally I doubt it will work too well. For your own question "words" is the clear winner in word count, followed by a list of words with two appearances each, like "common", "list", "method", "pick" and "tags". Which of those would you automatically choose as tags? Also the tags you chose manually contain "python" and "context", none of which shows up with high word frequency.
Train Bayes or Fischer filter with already tagged data (e.g. with Stackoverflow data dump suggested by sth) and use it to classify new posts. I'd recommend reading excellent Programming Collective Intelligence book by Toby Segaran for more information and python examples on this topic.
Instead of blacklisting words that shouldn't be tags, why don't you instead build a whitelist of words that would make for good tags?
Start with an handful of tags that you would like to have, like Python, off-topic, football, rickroll or whatnot (depends on the kind of site you are building!) and have the system only suggest between those, then let users handpick appropriate tags and also let them type in their own tags.
When enough users suggest a tag, it gets into the pool of "known good" tags for auto suggestion -- maybe after some sort of moderation, so that you can still blacklist stupid tags like the, lolol, or typoed tags like objectoriented when you have object-oriented.
Only show few suggestions. Offer autocompletion. Limit the number of tags per item. If this will be about coding, maybe some sort of language detection system (the file linux command is not too shabby on this) will help your suggestion system.

Automatically determine the natural language of a website page given its URL

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.
In Python, a function like:
def LanguageUsed (url):
#stuff
Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)
Summary of Results:
I have a reasonable solution working in Python using code from the PyPi for oice.langdet.
It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.
For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.
The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.
This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.
Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.
See http://code.google.com/apis/ajaxlanguage/documentation/
There is nothing about the URL itself that will indicate language.
One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like
Accept-Language: en-US
with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.
You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.
You might want to try ngram based detection.
TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.
Edit: TextCat competitors page provides some interesting links too.
Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...
nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.
There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.
So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.
In Python, the langdetect package (found here) can do this.
It is based on Googles automatic language detection and supports by default 55 languages.
It is installed by using
pip install langdetect
And then for example running
from langdetect import detect
detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")
Will return 'en' and 'de' respectively.

Categories

Resources