Tweet scraping while using Twint - python

I am doing some research on sentiment analysis of tweets. I have been using twint to scrape tweets from selected cities where I was getting more tweets. when I compared to scraping tweets for the whole world for the same hashtag for a duration of 5 years from 2010 to 2015. I was not able to understand why twint is doing like that. Here is my code:
import twint
import pandas as pd
import nest_asyncio
nest_asyncio.apply()
cities=['Hyderabad','Mumbai','Kolkata','Vijayawada', 'Warangal', 'Visakhapatnam']
unique_cities=set(cities) #To get unique cities of country
cities = sorted(unique_cities) #Sort & convert datatype to list
for city in cities:
print(city)
config = twint.Config()
config.Search = "#MarutiSuzuki"
config.Lang = "en"
config.Near = city
config.Limit = 1000000
config.Since = "2010–01–01"
config.Until = "2015–12–01"
config.Store_csv = True
config.Output = "my_finding.csv"
twint.run.Search(config)`

Maybe Twitter has a limit for showing the number of tweets when searched globally, for example, it only showcases X entries but when you narrow down the search more specifically based on the location it shows the maximum amount for that area. For instance, Amazon would only show 400 pages of the searched item even though there may be more, likewise, if you specify the details it may show more items than with the previous search.

Related

Only scrape google news articles containing an exact phrase in python

I'm trying to build a media tracker in python that each day returns all google news articles containing a specific phrase, "Center for Community Alternatives". If, one day, there are no new news articles that exactly contain this phrase, then no new links should be added to the data frame. The problem I am having is that even on days when there are no news articles containing my phrase, my code adds articles that with similar phrases to the data frame. How can I only append links that contain my exact phrase?
Below I have attached an example code looking at 03/01/22:
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
googlenews=GoogleNews(start='03/01/2022',end='03/01/2022')
googlenews.search('"' + "Center for Community Alternatives" + '"')
googlenews.getpage(1)
result=googlenews.result()
df=pd.DataFrame(result)
df
Even though, when you search "Center for Community Alternatives" (with quotes around it) in Google News for this specific date, there are No results found for "center for community alternatives", the code scrapes the links that appear below this, which are Results for center for community alternatives (without quotes).
The API you're using does not support exact match.
In https://github.com/Iceloof/GoogleNews/blob/master/GoogleNews/__init__.py:
def search(self, key):
"""
Searches for a term in google.com in the news section and retrieves the first page into __results.
Parameters:
key = the search term
"""
self.__key = "+".join(key.split(" "))
if self.__encode != "":
self.__key = urllib.request.quote(self.__key.encode(self.__encode))
self.get_page()
As an alternative, you could probably just filter your data frame using an exact match:
df = df['Center for Community Alternatives' in df['title or some column']]
Probably you're not getting any results due to:
There are not search results that matches your search term - "Center for Community Alternatives" and not in the date range you add in your question - 03/01/2022.
If you consider change the search term removing their double quotes AND if you increase the date range - you might have some results - that will depend entirely of how active they [the source] post news and how Google handles such topics.
What I suggest is to change your code for:
Keep the search term - Center for Community Alternatives without double quotes
Apply a longer date range to search
Get only distinct values - while doing test to this code, I got duplicated entries.
Get more than one page for increase the changes of get results.
Code:
#!pip install GoogleNews # https://pypi.org/project/GoogleNews/
#!pip install newspaper3k # https://pypi.org/project/newspaper3k/
from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd
search_term = "Center for Community Alternatives"
googlenews=GoogleNews(start='03/01/2022',end='03/03/2022') # I suppose the date is in "MM/dd/yyyy" format...
googlenews=GoogleNews(lang='en', region='US')
googlenews.search(search_term)
# Initial list of results - it will contain a list of dictionaries (dict).
results = []
# Contains the final results = news filtered by the criteria
# (news that in their description contains the search term).
final_results = []
# Get first 4 pages with the results and append those results to the list - you can set any other range according to your needs:
for page in range(1,4):
googlenews.getpage(page) # Consider add an timer for avoid multiple calls and get "HTTP Error 429: Too Many Requests" error.
results.extend(googlenews.result())
# Remove duplicates and include to the "final_results" list
# only the news that includes in their description the search term:
for item in results:
if (item not in final_results and (search_term in item["desc"])):
final_results.append(item)
# Build and show the final dataframe:
df=pd.DataFrame(results)
df
Keep in mind that probably you won't get results for factors outside of your reach.

How to get the number of tweets from a hastag in python?

I am trying to get the number of tweets containing a hashtag (let's say "#kitten") in python.
I am using tweepy.
However, all the codes I have found are in this form :
query = "kitten"
for i, status in enumerate(tweepy.Cursor(api.search, q=query).items(50)):
print(i, status)
I have this error : 'API' object has no attribute 'search'
Tweepy seemed to not cointain this object anymore. Is there any way to answer my problem ?
Sorry for my bad english.
After browsing the web and twitter documentation I found the answer.
If you want the historic of all tweet counts from 2006 you need Academic authorization. This is not my case so I can only get 7 days tracking which is enough in my case. Here is the code :
import tweepy
query = "kitten -is:retweet"
client = tweepy.Client(bearer_token)
counts = client.get_recent_tweets_count(query=query, granularity='day')
for i in counts.data:
print(i["tweet_count"])
The "-is:retweet" is here to not count the retweets. You need to remove it if you want to count them.
Since we're not pulling any tweets (only the volume of them) we are not increasing our MONTHLY TWEET CAP USAGE.
Be carefull when using symbols in your query such as "$" it might give you an error. For a list of valid operators see : list of valid operators for query
As said here Twitter counts introduction, you only need "read only" authorization to perform a recent count request. (see Recent Tweet counts)

Tweepy get Tweets related to a specific country

Context
I am working on a topic modeling for twitter project.
The idea is to retrieve all tweets related to a specific country and analyze them in order to discover what people from a specific country are talking about on Twitter.
What I have tried
1.First Solution
I know that we can use twitter streaming API or cursor to retrieve tweets from a specific country and I have tried the following code to get all tweets given geocodes coordinates of a country.
I have written the following code :
def get_tweets(query_fname, auth, max_time, location=None):
stop = datetime.now() + max_time
twitter_stream = Stream(auth, CustomListener(query_fname))
while datetime.now() < stop:
if location:
twitter_stream.filter(locations=[11.94,-13.64,30.54,5.19], is_async=True)
else:
twitter_stream.filter(track=query, is_async=True)
The problem of this approach
Not everyone has allowed Twitter to access his location details and with this approach, I can only get a few tweets something like 300 tweets for my location.
There are some persons who are not in the country but who tweet about the country and people within the country replies to them. Their tweets are not captured by this approach.
2.Second Solution
Another approach was to collect tweets with hashtags related to a country with a cursor
I have tried this code :
def query_tweet(client, query=[], max_tweets=2000, country=None):
"""
query tweets using the query list pass in parameter
"""
query = ' OR '.join(query)
name = 'by_hashtags_'
now = datetime.now()
today = now.strftime("%d-%m-%Y-%H-%M")
with open('data/query_drc_{}_{}.jsonl'.format(name, today), 'w') as f:
for status in Cursor(
client.search,
q=query,
include_rts=True).items(max_tweets):
f.write(json.dumps(status._json) + "\n")
Problem
This approach gives more results than the first one but as you may notice, not everyone uses those hashtags to tweets about the country.
3.Third approach
I have tried to retrieve the tweet using place id specific to a country but it gives the same problem as the first approach.
My questions
How can I retrieve all tweets about a specific country? I mean everything people are tweeting about for a specific country with or without country-specific hashtags?
Hint: For people who are not located in the country, It may be a good idea to get their tweets if they were replied or retweeted by people within the country.
Regards.

Discogs API => How to retrieve genre?

I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.

Crawl tweets using python (date range)

I'm trying to crawl tweets for my thesis. I'm using Pattern (http://www.clips.ua.ac.be/pages/pattern-web) to crawl (and for sentiment analysis) which requires a Python (2.7) program to run.
So far I've been able to come up with the program which you can find below. It works but only for collecting the X amount most recent tweets.
My question is: Could you help with making it so, that I can crawl tweets between a certain date range (for example: Jan 1 2014 - Mar 31 2014) for a specific username?
(Or if not possible, increase the amount of tweets crawled at this moment (using the same program for different usernames (each of which have 1000s of tweets), I get results anywhere between 40 and 400)).
Thank you very much in advance!
(PS: If none of th above are possible, I'm more than happy to listen to alternatives to collect the necessary tweets. I should add that I don't have a very strong background in programming.)
import os, sys; sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
import time
from pattern.web import Twitter, hashtags
from pattern.db import Datasheet, pprint, pd
from pattern.en import sentiment
try:
table = Datasheet.load(pd("test.csv"))
index = set(table.columns[0])
except:
table = Datasheet()
index = set()
engine = Twitter(language="en")
prev = None
for i in range(1000):
print i
for tweet in engine.search("from:username", start=prev, cached=False):
if len(table) == 0 or tweet.id not in index:
table.append([tweet.id, tweet.date, sentiment(tweet.text.encode("iso-8859-15", "replace"))])
index.add(tweet.id)
prev = tweet.id
# sleep time to avoid search limit error (180 requests per 15min window)
time.sleep(5.1)
table.save(pd("test.csv"))
print "Total results:", len(table)
print
Crawling tweets isn't a great approach but it would work as long as Twitter doesn't block your scraper. I'd recommend the Twitter API (both streaming and search API's). They would let you grab Tweets and store them in a database and do any analysis that you want to do.
Instead of Crawling I would recommend you to use Twitter's Streaming API..in that you will get more tweets than crawling.almost all the tweets if the rate limit is not reached upto 1% of firehose.Filters are also provided which you can use.
Python's Module for twitter's streaming api's are:-
Twython
twitter
tweepy
etc...
I use Twython. It is good. Hope this helps.

Categories

Resources