how to get English tweets alone using python? - python

Here is my current code
from twitter import *
t = Twitter(auth=OAuth(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET,
ACCESS_TOKEN, ACCESS_TOKEN_SECRET))
t.statuses.home_timeline()
query=raw_input("enter the query \n")
data = t.search.tweets(q=query)
for i in range (0,1000):
print data['statuses'][i]['text']
print '\n'
Here, I fetch tweets from all the languages. Is there a way to restrict myself to fetching tweets only in English?

There are at least 4 ways... I put them in the order of simplicity.
After you collect the tweets, the json output has a key/value pair that identifies the language. So you can use something like this to take all language tweets and select only the ones that are from English accounts.
for i in range (0,1000):
if data['statuses'][i][u'lang']==u'en':
print data['statuses'][i]['text']
print '\n'
Another way to collect only tweets that are identified in English, you can use the optional 'lang' parameter to request from the API only English (self-idenfitied) tweets. See details here. If you are using the python-twitter library, you can set the 'lang' parameter in twitter.py.
Use a language recognition package like guess-language.
Or if you want to recognize English text without using the self-identified twitter data (i.e. a chinese account that is writing in English), then you have to do Natural Language Processing. One option. This method will recognize common English words and then mark the text as English.

I try this for farsi:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
res = api.search('lang','fa')
for i in res:
print( i.lang)

Related

Is it possible to set multiple strings in query for search method of tweepy? python

What I want is to search tweets that have multiple words I choose on twitter with python.
The official doc dose not say anything but it seems that the search method only takes 1 query.
source code
import tweepy
CK=
CS=
AT=
AS=
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AT, AS)
api = tweepy.API(auth)
for status in api.search(q='word',count=100,): # I want to set multiple words in q but when I do.
print(status.user.id)
print(status.user.screen_name)
print(status.user.name)
print(status.text)
print(status.created_at)
What I have tried is below it didn't get any error but it searched only with the last word in the query in this case, the results were only tweets with the word "Python" it did not get tweets with both words.
for status in api.search(q='Java' and 'Python',count=100,)
Official doc
https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets
So my questions is that is it possible to set multiple words in query.
Is the way I wrote is simply wrong?
If so, please let me know.
If it can't set multiple words, I would appreciate if you could share simple python code that works for what I want to do.
Thank you in advance.
Use:
for status in api.search(q='Java Python', count=100)
From the Search Tweets: Standard v1.1 section Standard search operators:
watching now - containing both “watching” and “now”. This is the default operator.
As explained by Vlad Siv, just put each word you wish to look for in the speech marks for the query param. This should in turn look for tweets containing these words.

Retrieving the tweets related to a specific search between two dates

Im struggling to retrieve the tweets associated with a particular search between two dates. I looked at the answer here and used that as below, but, as the answer mentions, the code only works for tweets which are 10-14 days old and as I need tweets from 2014, it results in tweets being an empty list.
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
tweets = []
company_name = '#' + 'Apple'
date_strng = " since:2014-10-11 until:2015-10-14"
for tweet in tweepy.Cursor(api.search,q=company_name + date_strng,count=10000,lang="en").items():
tweets.append(tweet)
Also tried the following, but it didnt work (tweets is again an empty list). But if I remove the until argument, I get the tweets since the start_date:
start_date = datetime.datetime(2014,10,11)
end_date = datetime.datetime(2015,10,14)
for tweet in tweepy.Cursor(api.search,q=company_name,count=10000,lang="en", since=start_date,until=end_date).items():
tweets.append(tweet)
Was wondering if there is a solution to this.
Thanks
Reason for the empty list is due to the fact that the standard search api retrieve only last 7 days of tweets . Since you have given the start and until dates it’s filtering the tweets as per dates. Obviously list will be empty.
Refer the below link for retrieving old tweets
https://stackoverflow.com/a/61737450/10703097
Also you are trying 1 year duration of tweet which is a huge corpus of tweets try to modify as per your needs.

Twitter API: Get top tweets by query and WOEID place

Preferably via Tweepy in Python, I want to obtain from the Twitter API a list of top tweets for a given search query and WOEID place identifier (Yahoo's Where On Earth IDentifier).
In my example, I obtain trending queries for a WOEID id via Tweepy's API.trends_place(id) wrapper for the Twitter REST API's GET trends/place; I then want to print the top tweets for each trending query within this place (same WOEID).
Currently, I obtain tweets for the trending query, but
not within the given place;
not necessarily the "top" tweets (as opposed to, for example, "recent").
How can I add these two restrictions to my search?
MWE:
import tweepy
from tweepy import OAuthHandler
consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
locationid = 23424775 # WOEID for Canada
trendqueries = [trend['query'] for trend in api.trends_place(locationid)[0]['trends']]
for trendquery in trendqueries:
print(api.search(q=trendquery))
What I have tried:
I can search by longitude/latitude using Tweepy's API.search(q, geocode), but I do not see an obvious way to search by WOEID.
Partial answer
API.search(q[, lang][, locale][, rpp][, page][, since_id][, geocode][, show_user])
Returns tweets that match a specified query.
Parameters:
geocode – Returns tweets by users located within a given radius of the given latitude/longitude. The location is preferentially taking from the Geotagging API, but will fall back to their Twitter profile. The parameter value is specified by “latitide,longitude,radius”, where radius units must be specified as either “mi” (miles) or “km” (kilometers). Note that you cannot use the near operator via the API to geocode arbitrary locations; however you can use this geocode parameter to search near geocodes directly.
show_user – When true, prepends “:” to the beginning of the tweet. This is useful for readers that do not display Atom’s author field. The default is false.

Twitter API Streaming by Locatons

I'm using a Python's Twitter API implementation, TwitterAPI.
I'm trying get tweets from a specific city (São Paulo), in the Twitter Advanced Search(https://twitter.com/search-advanced) website is easy, but when I try to do it using streaming, never returns any tweet. (I know search-advanced is complete different from twitter streaming API)
Like follow the documentation I get the southwest coordinate first, and northeast after.
https://dev.twitter.com/streaming/overview/request-parameters#locations
#!/usr/bin/python
import pprint from TwitterAPI import TwitterAPI
pp = pprint.PrettyPrinter(depth=6)
api = TwitterAPI(CONSUMER_KEY,
CONSUMER_SECRET,
ACCESS_TOKEN_KEY,
ACCESS_TOKEN_SECRET)
r = api.request('statuses/filter', {'locations':'-23.984524,-46.885064,-23.393466,-46.479943'})
for item in r:
pp.pprint(item)
But I never got any tweet, what I'am doing wrong ?
You have the latitudes and longitudes reversed. Try:
r = api.request('statuses/filter', {'locations':'-46.885064,-23.984524,-46.479943,-23.393466'})
The locations parameter takes longitude/latitude pairs.
Does using locations as a list of float values help?
{'locations':[-46.885064,-23.984524,-46.479943,-23.393466]}

Logical Operators in Tweepy Filter

I'm hoping to track tweets that contain a certain set of words, but not others. For example, if my filter is: "taco" AND ("chicken" OR "beef").
It should return these tweets:
-I am eating a chicken taco.
-I am eating a beef taco.
It should not return these tweets:
-I am eating a taco.
-I am eating a pork taco.
Here is the code I'm currently running:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import json
# authentication data- get this info from twitter after you create your application
ckey = '...' # consumer key, AKA API key
csecret = '...' # consumer secret, AKA API secret
atoken = '...' # access token
asecret = '...' # access secret
# define listener class
class listener(StreamListener):
def on_data(self, data):
try:
print data # write the whole tweet to terminal
return True
except BaseException, e:
print 'failed on data, ', str(e) # if there is an error, show what it is
time.sleep(5) # one error could be that you're rate-limited; this will cause the script to pause for 5 seconds
def on_error(self, status):
print status
# authenticate yourself
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["taco"]) # track what you want to search for!
The last line of the code is the part I'm struggling with; if I use:
twitterStream.filter(track=["taco","chicken","beef"])
it will return all tweets containing any of the three words. Other things I've tried, such as:
twitterStream.filter(track=(["taco"&&("chicken","beef")])
return a syntax error.
I'm fairly new to both Python and Tweepy. Both this and this seem like similar queries, but they are related to tracking multiple terms simultaneously, rather than tracking a subset of tweets containing a term. I haven't been able to find anything in the tweepy documentation.
I know another option would be tracking all tweets containing "taco" then filtering by "chicken" or "beef" into my database, but I'm worried about running up against the 1% streaming rate limit if I do a general search and then filter it down within Python, so I'd prefer only streaming the terms I want in the first place from Twitter.
Thanks in advance-
Sam
Twitter does not allow you to be very precise in how keywords are matched. However, the track parameter documentation states that spaces within a keyword are equivelent to logicals ANDS. All of the terms you specify are OR'd together.
So, to achieve your "taco" AND ("chicken" OR "beef") example, you could try the parameters [taco chicken, taco beef]. This would match tweets containing the words taco and chicken, or taco and beef. However, this isn't a perfect solution, as a tweet containing taco, chicken, and beef would also be matched.

Categories

Resources