Crawl tweets using python (date range) - python

I'm trying to crawl tweets for my thesis. I'm using Pattern (http://www.clips.ua.ac.be/pages/pattern-web) to crawl (and for sentiment analysis) which requires a Python (2.7) program to run.
So far I've been able to come up with the program which you can find below. It works but only for collecting the X amount most recent tweets.
My question is: Could you help with making it so, that I can crawl tweets between a certain date range (for example: Jan 1 2014 - Mar 31 2014) for a specific username?
(Or if not possible, increase the amount of tweets crawled at this moment (using the same program for different usernames (each of which have 1000s of tweets), I get results anywhere between 40 and 400)).
Thank you very much in advance!
(PS: If none of th above are possible, I'm more than happy to listen to alternatives to collect the necessary tweets. I should add that I don't have a very strong background in programming.)
import os, sys; sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", ".."))
import time
from pattern.web import Twitter, hashtags
from pattern.db import Datasheet, pprint, pd
from pattern.en import sentiment
try:
table = Datasheet.load(pd("test.csv"))
index = set(table.columns[0])
except:
table = Datasheet()
index = set()
engine = Twitter(language="en")
prev = None
for i in range(1000):
print i
for tweet in engine.search("from:username", start=prev, cached=False):
if len(table) == 0 or tweet.id not in index:
table.append([tweet.id, tweet.date, sentiment(tweet.text.encode("iso-8859-15", "replace"))])
index.add(tweet.id)
prev = tweet.id
# sleep time to avoid search limit error (180 requests per 15min window)
time.sleep(5.1)
table.save(pd("test.csv"))
print "Total results:", len(table)
print

Crawling tweets isn't a great approach but it would work as long as Twitter doesn't block your scraper. I'd recommend the Twitter API (both streaming and search API's). They would let you grab Tweets and store them in a database and do any analysis that you want to do.

Instead of Crawling I would recommend you to use Twitter's Streaming API..in that you will get more tweets than crawling.almost all the tweets if the rate limit is not reached upto 1% of firehose.Filters are also provided which you can use.
Python's Module for twitter's streaming api's are:-
Twython
twitter
tweepy
etc...
I use Twython. It is good. Hope this helps.

Related

How to get the number of tweets from a hastag in python?

I am trying to get the number of tweets containing a hashtag (let's say "#kitten") in python.
I am using tweepy.
However, all the codes I have found are in this form :
query = "kitten"
for i, status in enumerate(tweepy.Cursor(api.search, q=query).items(50)):
print(i, status)
I have this error : 'API' object has no attribute 'search'
Tweepy seemed to not cointain this object anymore. Is there any way to answer my problem ?
Sorry for my bad english.
After browsing the web and twitter documentation I found the answer.
If you want the historic of all tweet counts from 2006 you need Academic authorization. This is not my case so I can only get 7 days tracking which is enough in my case. Here is the code :
import tweepy
query = "kitten -is:retweet"
client = tweepy.Client(bearer_token)
counts = client.get_recent_tweets_count(query=query, granularity='day')
for i in counts.data:
print(i["tweet_count"])
The "-is:retweet" is here to not count the retweets. You need to remove it if you want to count them.
Since we're not pulling any tweets (only the volume of them) we are not increasing our MONTHLY TWEET CAP USAGE.
Be carefull when using symbols in your query such as "$" it might give you an error. For a list of valid operators see : list of valid operators for query
As said here Twitter counts introduction, you only need "read only" authorization to perform a recent count request. (see Recent Tweet counts)

How to get many returned tweets from Twitter search API

I'm looking into the Twitter Search API, and apparently, it has a count parameter that determines "The number of tweets to return per page, up to a maximum of 100." What does "per page" mean, if I'm for example running a python script like this:
import twitter #python-twitter package
api = twitter.Api(consumer_key="mykey",
consumer_secret="mysecret",
access_token_key="myaccess",
access_token_secret="myaccesssecret")
results = api.GetSearch(raw_query="q=%23myHashtag&geocode=59.347937,18.072433,5km")
print(len(results))
This will only give me 15 tweets in results. I want more, preferably all tweets, if possible. So what should I do? Is there a "next page" option? Can't I just specify the search query in a way that gives me all tweets at once? Or if the number of tweets is too large, some maximum number of tweets?
Tweepy has a Cursor object that works like this:
for tweet in tweepy.Cursor(api.search, q="#myHashtag&geocode=59.347937,18.072433,5km", lang='en', tweet_mode='extended').items():
# handle tweets here
You can find more info in the Tweepy Cursor docs.
With TwitterAPI you would access pages this way:
pager = TwitterPager(api,
'search/tweets',
{'q':'#myHashtag', 'geocode':'59.347937,18.072433,5km'})
for item in pager.get_iterator():
print(item['text'] if 'text' in item else item)
A complete example is here: https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py

Getting more than 100 days of data web scraping Yahoo

Like many others I have been looking for an alternative source of stock prices now that the Yahoo and Google APIs are defunct. I decided to take a try at web scraping the Yahoo site from which historical prices are still available. I managed to put together the following code which almost does what I need:
import urllib.request as web
import bs4 as bs
def yahooPrice(tkr):
tkr=tkr.upper()
url='https://finance.yahoo.com/quote/'+tkr+'/history?p='+tkr
sauce=web.urlopen(url)
soup=bs.BeautifulSoup(sauce,'lxml')
table=soup.find('table')
table_rows=table.find_all('tr')
allrows=[]
for tr in table_rows:
td=tr.find_all('td')
row=[i.text for i in td]
if len(row)==7:
allrows.append(row)
vixdf= pd.DataFrame(allrows).iloc[0:-1]
vixdf.columns=['Date','Open','High','Low','Close','Aclose','Volume']
vixdf.set_index('Date',inplace=True)
return vixdf
which produces a dataframe with the information I want. Unfortunately, even though the actual web page shows a full year's worth of prices, my routine only returns 100 records (including dividend records). Any idea how I can get more?
The Yahoo Finance API was depreciated in May '17, I believe. Now, there are to many options for downloading time series data for free, at least that I know of. Nevertheless, there is always some kind of alternative. Check out the URL below to find a tool to download historical price.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
See this too.
https://blog.quandl.com/api-for-stock-data
I don't have the exact solution to your question but I have a workaround (I had the same problem and hence used this approach)....basically, you can use Bday() method - 'import pandas.tseries.offset' and look for x number of businessDays for collecting the data. In my case, i ran the loop thrice to get 300 businessDays data - knowing that 100 was maximum I was getting by default.
Basically, you run the loop thrice and set the Bday() method such that the iteration on first time grabs 100 days data from now, then the next 100 days (200 days from now) and finally the last 100 days (300 days from now). The whole point of using this is because at any given point, one can only scrape 100 days data. So basically, even if you loop through 300 days in one go, you may not get 300 days data - your original problem (possibly yahoo limits amount of data extracted in one go). I have my code here : https://github.com/ee07kkr/stock_forex_analysis/tree/dataGathering
Note, the csv files for some reason are not working with /t delimiter in my case...but basically u can use the data frame. One more issue I currently have is 'Volume' is a string instead of float....the way to get around is :
apple = pd.DataFrame.from_csv('AAPL.csv',sep ='\t')
apple['Volume'] = apple['Volume'].str.replace(',','').astype(float)
First - Run the code below to get your 100 days.
Then - Use SQL to insert the data into a small db (Sqlite3 is pretty easy to use with python).
Finally - Amend code below to then get daily prices which you can add to grow your database.
from pandas import DataFrame
import bs4
import requests
def function():
url = 'https://uk.finance.yahoo.com/quote/VOD.L/history?p=VOD.L'
response = requests.get(url)
soup=bs4.BeautifulSoup(response.text, 'html.parser')
headers=soup.find_all('th')
rows=soup.find_all('tr')
ts=[[td.getText() for td in rows[i].find_all('td')] for i in range (len(rows))]
date=[]
days=(100)
while days > 0:
for i in ts:
data.append (i[:-6])
now=data[num]
now=DataFrame(now)
now=now[0]
now=str(now[0])
print now, item
num=num-1

Twitter timeline in Python, but only getting 20ish results?

I'm a nub when it comes to python. I literally just started today and have little understanding of programming. I have managed to make the following code work:
from twitter import *
config = {}
execfile("config.py", config)
twitter = Twitter(
auth = OAuth(config["access_key"], config["access_secret"],
config["consumer_key"], config["consumer_secret"]))
user = "skiftetse"
results = twitter.statuses.user_timeline(screen_name = user)
for status in results:
print "(%s) %s" % (status["created_at"], status["text"].encode("ascii",
"ignore"))
The problem is that it's only printing 20 results. The twitter page i'd like to get data from has 22k posts, so something is wrong with the last line of code.
screenshot
I would really appreciate help with this! I'm doing this for a research sentiment analysis, so I need several 100's to analyze. Beyond that it'd be great if retweets and information about how many people re tweeted their posts were included. I need to get better at all this, but right now I just need to meet that deadline at the end of the month.
You need to understand how the Twitter API works. Specifically, the user_timeline documentation.
By default, a request will only return 20 Tweets. If you want more, you will need to set the count parameter to, say, 50.
e.g.
results = twitter.statuses.user_timeline(screen_name = user, count = 50)
Note, count:
Specifies the number of tweets to try and retrieve, up to a maximum of 200 per distinct request.
In addition, the API will only let you retrieve the most recent 3,200 Tweets.

Discogs API => How to retrieve genre?

I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.

Categories

Resources