Retrieving Twitter data on the fly - python

Our company is trying to read in all live streams of data entered by random users, i.e., a random user sends off a tweet saying "ABC company".
Seeing as how you could use a twitter client to search for said text, I labour under the assumption that it's possible to aggregate all tweets that send off ones without using a client, i.e., to file, streaming in live without using hashtags.
What's the best way to do this? And if you've done this before, could you share your script? I reckon the simplest way would be via ruby/python script left running, but my understanding of ruby/python is limited at best.
Kindly help?

Here's a bare minimum:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import twitter
from threading import *
from os import _exit, urandom
from time import sleep
from logger import *
import unicodedata
## Based on: https://github.com/sixohsix/twitter
class twitt(Thread):
def __init__(self, tags = None, *args, **kwargs):
self.consumer_key = '...'
self.consumer_secret = '...'
self.access_key = '...'
self.access_secret = '...'
self.encoding = 'iso-8859-15'
self.args = args
self.kwargs = kwargs
self.searchapi = twitter.Twitter(domain="search.twitter.com").search
Thread.__init__(self)
self.start()
def search(self, tag):
try:
return self.searchapi(q=tag)['results']
except:
return {}
def run(self):
while 1:
sleep(3)
To use it, do something like:
if __name__ == "__main__":
t = twitt()
print t.search('#DHSupport')
t.alive = False
Note: The only reason this is threaded is because it's just a piece of code i had laying around for other projects, it gives you an idea how to work with the API and perhaps build a background service to fetch search results on twitter.
There's a lot of crap in my original code so the structure might look a bit odd.
Note that you don't really need the consumer_keys etc for just a search but you will need OAuth login for more features such as posting or checking messages.
The only two things you really need is:
import twitter
print twitter.Twitter(domain="search.twitter.com").search(q='#hashtag')['results']

Related

Converting Webelement.text to a string

I am trying to use machine learning to perform sentiment analysis on data from twitter. To aggregate the data, I've made a class which will mine and
pre-process data. In order to clean and pre-process the data, I'd like to convert each tweet's text to a string. However, when the line of code in the inner for loop in the massMine method is called, i get a WebDriverException: no such session. The relevant bits of code are below, any input is appreciated, thanks.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
import pandas
import re
class TweetMiner(object):
def __init__(self):
self.base_url = u'https://twitter.com/search?q=from%3A'
self.raw_data = []
def mineTweets(self, query, tweet_quota):
'''
Mine data from a singular twitter account,
input consists of a twitter handle, and a
value indicating how much data to mine
Ex: “#diddy” should be inputted as “diddy”
'''
browser = webdriver.Chrome()
url = self.base_url + query
browser.get(url)
time.sleep(1)
body = browser.find_element_by_tag_name('body')
for _ in range(tweet_quota):
body.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
tweets = browser.find_elements_by_class_name('tweet-text')
for tweet in tweets:
print(tweet.text)
browser.close()
return tweets
def massMine(self, inputArray, dataSize):
'''
Mine data from an array of twitter
accounts, input array consists of twitter
handles and a value indicating how much
data to mine
Ex: “#diddy” should be inputted as “diddy”
'''
for user in inputArray:
rtn = ""
tweets = self.mineTweets(user,dataSize)
for tweet in tweets:
rtn += (tweet.text)
return rtn
EDIT: I don't know what caused this error - but if anyone stumbles across this post with a similar error I was able to workaround by simply writing each tweet to a text file.
I use to get this error when I have opened too many browser instances and haven't closed it properly (both via automation script or manually). When other browser instances were killed one by one this error is removed. I found that C:\Users\(yourAccountName)\AppData\Local\Temp directory is totally filled up and hence causing the NoSuchSession error.
Preferred solution will be to see if too many browsers/tabs are open. Remove them. Or manually remove all the contents inside above Temp path and try.

Optimise python function fetching multi-level json attributes

I have a 3 level json file. I am fetching the values of some of the attributes from each of the 3 levels of json. At the moment, the execution time of my code is pathetic as it is taking about 2-3 minutes to get the results on my web page. I will be having a much larger json file to deal with in production.
I am new to python and flask and haven't done much of web programming. Please suggest me ways I could optimise my below code! Thanks for help, much appreciated.
import json
import urllib2
import flask
from flask import request
def Backend():
url = 'http://localhost:8080/surveillance/api/v1/cameras/'
response = urllib2.urlopen(url).read()
response = json.loads(response)
components = list(response['children'])
urlComponentChild = []
for component in components:
urlComponent = str(url + component + '/')
responseChild = urllib2.urlopen(urlComponent).read()
responseChild = json.loads(responseChild)
camID = str(responseChild['id'])
camName = str(responseChild['name'])
compChildren = responseChild['children']
compChildrenName = list(compChildren)
for compChild in compChildrenName:
href = str(compChildren[compChild]['href'])
ID = str(compChildren[compChild]['id'])
urlComponentChild.append([href,ID])
myList = []
for each in urlComponentChild:
response = urllib2.urlopen(each[0]).read()
response = json.loads(response)
url = each[0] + '/recorder'
responseRecorder = urllib2.urlopen(url).read()
responseRecorder = json.loads(responseRecorder)
username = str(response['subItems']['surveillance:config']['properties']['username'])
password = str(response['subItems']['surveillance:config']['properties']['password'])
manufacturer = str(response['properties']['Manufacturer'])
model = str(response['properties']['Model'])
status = responseRecorder['recording']
myList.append([each[1],username,password,manufacturer,model,status])
return myList
APP = flask.Flask(__name__)
#APP.route('/', methods=['GET', 'POST'])
def index():
""" Displays the index page accessible at '/'
"""
if request.method == 'GET':
return flask.render_template('index.html', response = Backend())
if __name__ == '__main__':
APP.debug=True
APP.run(port=62000)
Ok, caching. So what we're going to do is start returning values to the user instantly based on data we already have, rather than generating new data every time. This means that the user might get slightly less up to date data than is theoretically possible to get, but it means that the data they do receive they receive as quickly as is possible given the system you're using.
So we'll keep your backend function as it is. Like I said, you could certainly speed it up with multithreading (If you're still interested in that, the 10 second version is that I would use grequests to asynchronously get data from a list of urls).
But, rather than call it in response to the user every time a user requests data, we'll just call it routinely every once in a while. This is almost certainly something you'd want to do eventually anyway, because it means you don't have to generate brand new data for each user, which is extremely wasteful. We'll just keep some data on hand in a variable, update that variable as often as we can, and return whatever's in that variable every time we get a new request.
from threading import Thread
from time import sleep
data = None
def Backend():
.....
def main_loop():
while True:
sleep(LOOP_DELAY_TIME_SECONDS)
global data
data = Backend()
APP = flask.Flask(__name__)
#APP.route('/', methods=['GET', 'POST'])
def index():
""" Displays the index page accessible at '/'
"""
if request.method == 'GET':
# Return whatever data we currently have cached
return flask.render_template('index.html', response = data)
if __name__ == '__main__':
data = Backend() # Need to make sure we grab data before we start the server so we never return None to the user
Thread(target=main_loop).start() #Loop and grab new data at every loop
APP.debug=True
APP.run(port=62000)
DISCLAIMER: I've used Flask and threading before for a few projects, but I am by no means an expert on it or web development, at all. Test this code before using it for anything important (or better yet, find someone who knows that they're doing before using it for anything important)
Edit: data will have to be a global, sorry about that - hence the disclaimer

Use the Google Custom Search API to search the web from Python

I'm a newbee in Python, HTML and CSS and am trying to reverse engineer "https://github.com/scraperwiki/google-search-python" to learn the three and use the Google Custom Search API to search the web from Python. Specifically, I want to search the search engine I made through Google Custom Search "https://cse.google.com/cse/publicurl?cx=000839040200690289140:u2lurwk5tko"I looked through the code made some minor adjustments and came up with the following. "Search.py"
import os
from google_search import GoogleCustomSearch
#This is for the traceback
import traceback
import sys
#set variables
os.environ["SEARCH_ENGINE_ID"] = "000839... "
os.environ["GOOGLE_CLOUD_API_KEY"] = "AIza... "
SEARCH_ENGINE_ID = os.environ['SEARCH_ENGINE_ID']
API_KEY = os.environ['GOOGLE_CLOUD_API_KEY']
api = GoogleCustomSearch(SEARCH_ENGINE_ID, API_KEY)
print("we got here\n")
#for result in api.search('prayer', 'https://cse.google.com/cse/publicurl?cx=000839040200690289140:u2lurwk5tko'):
for result in api.search('pdf', 'http://scraperwiki.com'):
print(result['title'])
print(result['link'])
print(result['snippet'])
print traceback.format_exc()
And the import ("At least the relevant parts") I believe comes from the following code in google_search.py
class GoogleCustomSearch(object):
def __init__(self, search_engine_id, api_key):
self.search_engine_id = search_engine_id
self.api_key = api_key
def search(self, keyword, site=None, max_results=100):
assert isinstance(keyword, basestring)
for start_index in range(1, max_results, 10): # 10 is max page size
url = self._make_url(start_index, keyword, site)
logging.info(url)
response = requests.get(url)
if response.status_code == 403:
LOG.info(response.content)
response.raise_for_status()
for search_result in _decode_response(response.content):
yield search_result
if 'nextPage' not in search_result['meta']['queries']:
print("No more pages...")
return
However, when I try to compile it, I get the following.
So, here's my problem. I cant quite figure out why the following lines of code don't print to the terminal. What am I overlooking?
print(result['title'])
print(result['link'])
print(result['snippet'])
The only thing I can think of is that I didn't take a correct ID or something. I created a Google custom search and a project on Google developers console as the quick start suggested. Here is where I got my SEARCH_ENGINE_ID and GOOGLE_CLOUD_API_KEY from.
After I added the stacktrace suggested in the comments I got this
Am I just misunderstanding the code, or is there something else I'm missing? I really appreciate any clues that will help me solve this problem, I'm kind of stumped right now.
Thanks in advance guys!

Python OOP - web session

I have the following script:
import mechanize, cookielib, re ...
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.addheaders = ....
and do stuff
Because my script is growing very big, I want to split it in classes. One class to handle web-connection, one class to do stuff and so on.
From what I read, I need something like:
from web_session import * # this my class handling web-connection (cookies + auth)
from do_stuff import * # i do stuff on web pages
and in my main, I have:
browser = Web_session()
stuff = Do_stuff()
the problem for me is that I lose session cookies when I pass it to Do_stuff. Can anyone help me with a basic example of classes and interaction , lets say: I log in on site, a browse a page and I want to do something like re.findall("something", one_that_page). Thanks in advance
Update:
Main Script:
br = WebBrowser()
br.login(myId, myPass)
WebBrowser class:
class WebBrowser():
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
self.browser.addheaders = ....
def login(self, username, password):
self.username = username
self.password = password
self.browser.open(some site)
self.browser.submit(username, password)
def open(self, url):
self.url = url
self.browser.open(url)
def read(self, url):
self.url = url
page = self.browser.open(url).read()
return page
Current state:
This part works perfectly, I can login, but I lose the mechanize class "goodies" like open, post o read an url.
For example:
management = br.read("some_url.php")
all my cookies are gone (error:must be log in)
How can I fix it?
The "mechanise.Browser" class has all the functionality it seens you want to put on your "Web_session" class (side note - naming conventions and readility would recomend "WebSession" instead).
Anyway, you will retain you cookies if you keep the same Browser object across calls - if you really go for having another wrapper class, just create a mehcanize.Broser when instantiating your Web_session class, and keep that as an object attribute (for example, as "self.browser") .
But, you most likelly don't need to do that - just create a Browser on the __init__ of your Do_stuff, keep it as an instance attribute, and reuse it for all requests -
class DoStuff(object):
def __init__(self):
self.browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
self.browser.set_cookiejar(cj)
def login(self, credentials):
self.browser.post(data=credentials)
def match_text_at_page(self, url, text):
# this will use the same cookies as are used in the login
req = self.browser.get(url)
return re.findall(text, req.text)
Never use the construct from X import * as in
from web_session import *
from do_stuff import *
It's ok when you are experimenting in an interactive session, but don't use it in your code.
Imagine the following: In web_session.py you have a function called my_function, which you use in your main module. In do_stuff.pyyou have an import statement from some_lib_I_found_on_the_net import *. Everything is nice, but after a while, your program mysteriously fails. It turns out that you upgraded some_lib_I_found_on_the_net.py, and the new version contained a function called my_function. Your main program is suddenly calling some_lib_I_found_on_the_net.my_functioninstead of web_session.my_function. Python has such nice support for separating concerns, but with that lazy construct, you'll just shoot yourself in the foot, and besides, it's so nice to be able to look in your code and see where every object comes from, which you don't with the *.
If you want to avoid long things like web_session.myfunction(), do either import web_session as ws and then ws.my_function()or from web_session import my_function, ...
Even if you only import one single module in this way, it can bite you. I had colleagues who had something like...
...
import util
...
from matplotlib import *
...
(a few hundred lines of code)
...
x = util.some_function()
...
Suddenly, they got an AttributeError on the call to util.some_function which had worked as a charm for years. However they looked at the code, they couldn't understand what was wrong. It took a long time before someone realized that matplotlib had been upgraded, and now it contained a function called (you guessed it) util!
Explicit is better than implicit!

My Python script doesn't give me an error or shows any output

I'm creating a simple transit twitter-bot which posts a tweet to my API, then grabs the result to later on reply with an answer on travel times and such. All the magic is on the server-side , and this code should work just fine. Here's how:
A user composes like the tweet below:
#kollektiven Sundsvall Navet - Ljustadalen
My script removes the #kollektiven from the tweet, send the rest Sundsvall Navet - Ljustadalen to our API. Then a JSON should be given to the script. The script should later on reply you with an answer like this:
#jackbillstrom Sundsvall busstation Navet (2014-01-08 20:45) till Ljustadalen centrum (Sundsvall kn) (2014-01-08 20:59)
But it doesn't. I'm using this code from github called spritzbot. I edited the extensions/hello.py to look like the one below:
# -*- coding: utf-8 -*-
import json, urllib2, os
os.system("clear")
def process_mention(status, settings):
print status.user.screen_name,':', status.text.encode('utf-8')
urlencode = status.text.lower().replace(" ","%20") # URL-encoding
tweet = urlencode.strip('#kollektiven ')
try:
call = "http://xn--datorkraftfrvrlden-xtb17a.se/kollektiven/proxy.php?input="+tweet # Endpoint
endpoint = urllib2.urlopen(call) # GET-Request to API endpoint
data = json.load(endpoint) # Load JSON
answer = data['proxyOutput'] # The answer from the API
return dict(response=str(answer)) # Posts answer tweet
except:
return dict(response="Error, kontakta #jackbillstrom") # Error-meddelande
What is causing this problem? And why? I made some changes before I came to this revision, and it worked back then.
You need:
if __name__ == '__main__':
process_mention(...)
...
You're not calling process_mention anywhere, just defining it.

Categories

Resources