Extract tweets that match a string of words exactly - python

I am trying to get tweets that have an exact match to a string.
Here is the code:
query = "last dance"
language="en"
results = api.search(q=query, lang=language, count=200)
But I get results involving tweets that have the words last and dance separately. But I want tweets that have the words last dance together.

Please, make sure to URL encode these queries before making the request. There are several online tools to help you to do that, or you can search at twitter.com/search and copy the encoded URL from the browser’s address bar. The table below shows some example mappings from search queries to URL encoded queries:
Search query URL encoded query
#haiku #poetry %23haiku+%23poetry
“happy hour” :) %22happy%20hour%22%20%3A%29
Note that the space character can be represented by “%20” or “+” sign.
For more info - https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators

Related

Is it possible to set multiple strings in query for search method of tweepy? python

What I want is to search tweets that have multiple words I choose on twitter with python.
The official doc dose not say anything but it seems that the search method only takes 1 query.
source code
import tweepy
CK=
CS=
AT=
AS=
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AT, AS)
api = tweepy.API(auth)
for status in api.search(q='word',count=100,): # I want to set multiple words in q but when I do.
print(status.user.id)
print(status.user.screen_name)
print(status.user.name)
print(status.text)
print(status.created_at)
What I have tried is below it didn't get any error but it searched only with the last word in the query in this case, the results were only tweets with the word "Python" it did not get tweets with both words.
for status in api.search(q='Java' and 'Python',count=100,)
Official doc
https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/api-reference/get-search-tweets
So my questions is that is it possible to set multiple words in query.
Is the way I wrote is simply wrong?
If so, please let me know.
If it can't set multiple words, I would appreciate if you could share simple python code that works for what I want to do.
Thank you in advance.
Use:
for status in api.search(q='Java Python', count=100)
From the Search Tweets: Standard v1.1 section Standard search operators:
watching now - containing both “watching” and “now”. This is the default operator.
As explained by Vlad Siv, just put each word you wish to look for in the speech marks for the query param. This should in turn look for tweets containing these words.

Excluding link at the end while pulling tweets in tweepy Streaming

I am pulling text or extended_text using tweepy streaming, but when I pull these tweets, there is always a t.co/randomletters link at the end that leads to nowhere. What is it and how do I get rid of it?
Here is an example:
"text": "To make room for more expression, we will now count all emojis as equal—including those with gender‍‍‍ ‍‍and skin tone modifiers https://t.co(forward slash)MkGjXf9aXm"
Please help
As far as my experience with twitter and tweepy goes, these URL's are included in a tweet's text whenever there is a URL of some sort in the actual tweet, so we can't really avoid getting them.
You could remove them after you get them, this is a simple regex that replaces the pattern of these URL's with a blank string.
import re
re.sub(r' https://t.co/\w{10}', '', tweet_text)

Extracting URLs of certain pattern from text

I have following URLs that needs to be fetched from page. This is the attempt I made but it is skipping part of string if it has =. Different kinds of URLs given below:
AjaxRender.htm?encparams=2~2586506573108327708~9SpSI_aPBiryk3VIKwmkjN-FD4jkS5GoDobsBCN6pRnZOhBsmrEOgT8vg5KciKjOmt25k3kEDZ00r7f48bIsRPSZWTHJbSCpS815cCNyQrsobBzLZlao8ww-rWwg0lLDIb10gJ3vWUl3zLIAQi5vBGLglJKXcSEg7wCXZUEm5aVHCQiGChz5f8oeiBtPXAV_A9XQ7xU5HUzyzTzyEMJICw==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=7~502085760588479881~-8dtDO_8-jpTBfqALerDcDLkIIRWnom8BG9WdtIVgqGTlRDn37waNvbaM_VHLrcntsGabZPzMiFlNxsrmqx4VpCZrtJmjyCcOBr9AY1B2GxnTlh3ngYfIYbhnDi-W6Hpb8V77OSS-WviMKsgF87gcWvjGzEd02a7Q_3XQ2FvdZ2rvwDlwG4izypuO7Ob63Gh&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=5~6917684668841666406~70K0Ijfg4OhPeKlzLP8aQV4JjBq9WuXnpC3enGYXfyWj5-28RyHRnjGRJypZBi0knr3io-9UdjdlOWuLqisI_pkZ0hQzFA5bhlRkX7siC6uMUA6A_MntiLDNGTrKN47TvrAxRd_JpQQUprReVHYwSdUEQvVUtpKn1_Ku5WG_zyWe_0Sd7FLftU1ti6pYf_tfMyNiDalQzyrPDQ35sAXcYIDyhSYI08uZCmTq5vrjSNkQChnMSW73MKri42rVM3JVP8j5LfCf3Zrws54M8KkFRnvfsyYeYd-hATgywsv9i2rtU3A-KPP6lSrL6jqbkAXVTezFRYV00ZNUhvX8NrL8Ew==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=2~3180584448022130058~v_d5bPfBCJINSmPxaUaByy3S5n5h5UbQ53k5QKhqYbz7KXeHku95HjcqE2MnU18rRhcdnBghBW90u-GS3tqZc5FBGt6Z9-mNBnr6RPTiAlIdlG9vO8QDW7e7vMS5H2Yue3sRQ5ANzNKGoAXe3Z5GpC1HWW9DA55OGRkLRsGdNRbN3VkqiVpObCQNGHyDYhfrh_WF8uPpAb5WE2s9sDvrSVDkUfuvHclAarXoua9OYsDQtYaDGxaquDkZrIO-VEYgjv-CPKwCkOQyOVqdq--QQ-GQNvi8vHk05uoiU6-9Kg4=&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=5~7279828915011575224~9KhHzCPV9FXMYfGNPF7W0MNL_4Ljv3YFdCr_JVtQN0GhUhD7ohGtUTYCzRJvS4sI6uyoM3TTrNmHaMsidk_BiN9qXRKpdEhJHGgfHHLzU1vtAXejIwnQUxB5Oexjkt74WeBnEfVSrxVfvhRM3LoB076SYiK5x92bA8WqJg62YtsUWV7vqtsCpvKyn9ssF7nnjlTmUqIWpBkqC9ZtcfN7-A==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=7~502085760588479881~-8dtDO_8-jpTBfqALerDcDLkIIRWnom8BG9WdtIVgqGTlRDn37waNvbaM_VHLrcntsGabZPzMiFlNxsrmqx4VpCZrtJmjyCcOBr9AY1B2GxnTlh3ngYfIYbhnDi-W6Hpb8V77OSS-WviMKsgF87gcWvjGzEd02a7Q_3XQ2FvdZ2rvwDlwG4izypuO7Ob63Gh&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=3~4781276045400603393~duZpRpWJA0naDjmpXNSp__ILjEXoOrwiv9SVBUjldBK4ebRdYWlzxwRudeyHrXoCC-XM_xEKr475_ViwwaHlnqFgEqteM3N6bDAgOxWEc8Y5Klh5d3Ivb_6qG6VsfMmp8oaT3nLnuALjX8vfqBN72WsNlwWeGMR3lOTuQnHgbl2betlejT6KsRx7ycVv71mxe8BP7oDIdI29Baetjlv1YA==&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=6~7112793196313446100~IVBMr0jpuDOH9HKclY47FtAJQXrgqOsD6P7mbOwJOcbDWAbviVmEg1HZScYqiKL5svd6BGA7jm22V6uEvquNb_-cZEyfDIFGbNxF3WNTwXcGX13GWcVi6tg7Acgdw8SHEEvhJzw1U01lvMS-Ptks6eeWj0cDdM_Al9hS5WkUA4ZR7rQK5CU9Uovn9WWF5I-6Ot0zcXZKaJMNIndiPYdIq0rpcpehlB8k&rwebid=8347449&rhost=1
AjaxRender.htm?encparams=10~1438958542856547329~OUrqnIrSPt0QON_7Q12RhcKfwyc22cFvE0xIIobEoUIFu91yWu5SK_jSW59wazXcfcxjpZnQ9YTWAH5kxu8H2B-lu2vO9J47cqg9ThA6AvDFRhj-6moF1_6ymrCKqhbcJdQddN24hShw9IwJOs2uDYJ2bECVJlnoraak4PGtBLHV4TnoVy9eZJxVPNB3XbIumIivk84XZyg=&rwebid=8347449&rhost=1
The issue I am facing that string before &rwebid gets - or = occasionally (base 64?) which breaks thing.
Updated
https://regex101.com/r/pudx92/2
Why not just stop on string delimiter " ?
AjaxRender[.]htm[?]encparams=[^\"]*

Using tweepy to get unique tweets

I am trying to get a corpus of Tweets using a number of search terms. One issue I am having is that it is not being able to get unique tweets. That is, retweets.
Is there a way to remove these beforehand without doing any text processing?
What I've got now:
api=tweepy.API(auth)
for search in hashtags:
for tweet in tweepy.Cursor(api.search,q=search,count=1000,lang="en").items():
text=repr(tweet.text.encode("utf-8"))
out.write(text+"\n")
You can add " -filter:retweets" to your query to only get original tweets. Maybe not the prettiest solution, but it works.
api=tweepy.API(auth)
for search in hashtags:
for tweet in tweepy.Cursor(api.search,q=search+" -filter:retweets",count=1000,lang="en").items():
text=repr(tweet.text.encode("utf-8"))
out.write(text+"\n")

Django OperationalError: "Error 'repetition-operator operand invalid' from the regexp"

I have asked this question before and the answer led to this error that I am unable to solve.
I have an Article model that takes in articles from users. These articles can get #hashtags in them like we have in twitter. I have these hashtags converted to links that users can click to load all articles that have the clicked hashtags in them.
If have these articles saved in Article model:
1. 'For the love of learning: why do we give #Exam?'
2. 'Articles containing #Examination should not come up when exam is clicked'
3. 'This is just an #example post'
I tried using Django's __icontains filter
def hash_tags(request, hash_tag):
hash_tag = '#' + hash_tag
articles = Articles.objects.filter(content__icontains=hash_tag)
articles = list(articles)
return HttpResponse(articles)
but if user clicks on #exam the three articles are returned instead of the first one.
I can add space to '#exam' to become '#exam ' and it will work out fine but I want to be able to do it with regex.
I have tried:
articles = Articles.objects.filter(content__iregex=r"{0}\b".format(hash_tag))
but I get empty response.
And this:
articles = Articles.objects.filter(content__iregex=r"(?i){0}\b".format(hash_tag))
returns "Error 'repetition-operator operand invalid' from regexp"
How do I do this correctly to have it work? I am using Django 1.6 and MySQL at backend.
MySQL doesn't understand Perl regexp. You must read MySQL regex and use [[:>:]]. The string '\b' in Python is a backspace.
For regex in Python must be used double backspace '\\b' or "raw" string prefixed with "r" r'\b'.
You should check safe characters in the hashtag. A bad user can otherwise construct a regex that would be analyzed forever (DOS attack).

Categories

Resources