Stream Tweets asynchronously with Tweepy AsyncStream

Stream Tweets asynchronously with Tweepy AsyncStream - python

I'm trying to run AsyncStream Tweepy, but I ran into a problem
My code
from __future__ import absolute_import, print_function
from tweepy.streaming import Stream
from tweepy import OAuthHandler
from tweepy import Stream
from pprint import pprint
from tweepy.asynchronous import AsyncStream
import asyncio
async def main():
stream = StdOutListener(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
await stream.filter(follow=['1082189695252074496'])
await asyncio.sleep(1.5)
class StdOutListener(AsyncStream):
async def on_status(self, status):
print(status_json)
async def on_error(self, status):
print(status)
if __name__ == '__main__':
asyncio.run(main())
When I run it in .py file, it doesn't work and returns the error "An HTTP: 420 error occurred in the stream".
I also run the code in Jupyter Notebook, only instead of async io.run(main ()), I write await main(), it also returns this error, BUT the stream works and it returns a response.
Why does it work in Jupyter Notebook, but does not work in the .py file. How can this be fixed?

According to the Tweepy documentation section on Handling Errors:
If clients exceed a limited number of attempts to connect to the streaming API in a window of time, they will receive error 420. The amount of time a client has to wait after receiving error 420 will increase exponentially each time they make a failed attempt.
Tweepy’s Stream Listener passes error codes to an on_error stub. The default implementation returns False for all codes, but we can override it to allow Tweepy to reconnect for some or all codes...
Here's another reference to the Twitter API documentation section on HTTP Error Codes.
The first isssue is that print(status_json) should be print(status._json).
The second issue is that the on_status method needs to conditionally return True or False based on status._json, like so:
async def on_status(self, status):
if hasattr(status, "_json"):
print(status._json)
# returning non-False continues the stream
return True
else:
# returning False disconnects the stream
return False
The third issue is that the on_error method needs to conditionally return True or False based on the value of status, like so:
async def on_error(self, status):
if status == 420:
# returning False disconnects the stream
return False
else:
# returning non-False continues the stream
return True

Related

How to automatically generate partition keys for messages (Kafka + Python)?

I'm trying to generate keys for every message in Kafka, for that purpose I want to create a key generator that joins the topic first two characters and the tweet id.
Here is an example of the messages that get sent in kafka:
{"data":{"created_at":"2022-03-18T09:51:12.000Z","id":"1504757303811231755","text":"#Danielog111 #POTUS #NATO #UNPeacekeeping #UN Yes! Not to minimize Ukraine at all, but to bring attention to a horrific crisis and Tigrayan genocide that targets 7M people, longer time frame, and is largely unacknowledged by western news agencies. And people are being eaten-literally! #maddow #JoyAnnReid help Ethiopians!"},"matching_rules":[{"id":"1502932028618072070","tag":"NATO"},{"id":"1502932021731115013","tag":"Biden"}]}'
And here is my code modified to try generating partition keys (I'm using PyKafka):
from dotenv import load_dotenv
import os
import json
import tweepy
from pykafka import KafkaClient
# Getting credentials:
BEARER_TOKEN=os.getenv("BEARER_TOKEN")
# Setting up pykafka:
def get_kafka_client():
return KafkaClient(hosts='localhost:9092,localhost:9093,localhost:9094')
def send_message(data, name_topic, id):
client = get_kafka_client()
topic = client.topics[name_topic]
producer = topic.get_sync_producer()
producer.produce(data, partition_key=f"{name_topic[:2]}{id}")
# Creating a Twitter stream listener:
class Listener(tweepy.StreamingClient):
def on_data(self, data):
print(data)
message = json.loads(data)
for rule in message['matching_rules']:
send_message(data, rule['tag'], message['data']['id'].encode())
return True
def on_error(self, status):
print(status)
# Start streaming:
Listener(BEARER_TOKEN).filter(tweet_fields=['created_at'])
And this is the error I'm getting:
File "/Users/mac/.local/share/virtualenvs/tweepy_step-Ck3DvAWI/lib/python3.9/site-packages/pykafka/producer.py", line 372, in produce
raise TypeError("Producer.produce accepts a bytes object as partition_key, "
TypeError: ("Producer.produce accepts a bytes object as partition_key, but it got '%s'", <class 'str'>)
I've also tried not encoding it and trying to fetch the id just using the data (that comes in bytes) but none of these options work.

I found the error, I should've been encoding the partition key and not the json id:
def send_message(data, name_topic, id):
client = get_kafka_client()
topic = client.topics[name_topic]
producer = topic.get_sync_producer()
producer.produce(data, partition_key=f"{name_topic[:2]}{id}".encode())
# Creating a Twitter stream listener:
class Listener(tweepy.StreamingClient):
def on_data(self, data):
print(data)
message = json.loads(data)
for rule in message['matching_rules']:
send_message(data, rule['tag'], message['data']['id'])
return True
def on_error(self, status):
print(status)

Sending external event to Durable Function Orchestrator causes Orchestrator to die

Im using a HTTP trigger to trigger an Orchestrator Function that runs multiple activity functions. The Http trigger is called every few minutes to retrieve the status of the Orchestration. The activity functions require an external token for certain operations. This token is send with the HTTP trigger to the Orchestrator. To avoid using an expired token when the Orchestrator runs longer, I included an external event into the Http trigger that sends a new token on each call of the trigger, since I didn't find another way to send new data to a running Orchestrator. Now if I add the wait_for_external_event function after my activity functions in the Orchestrator everything runs properly. But if I set it before the activities are called it causes the Orchestrator to stop working. The client.get_status function of the HTTP trigger returns a failed status after the first run.
I am not sure as to why this is, from my understanding it should not make a difference as to when I wait for the external event. Is there any reason why this is happening? In the monitoring the Orchestrator is still shown as "running".
This is my http trigger:
import logging
import azure.functions as func
import azure.durable_functions as df
import json
import uuid
from azure.durable_functions.models.OrchestrationRuntimeStatus import OrchestrationRuntimeStatus
async def main(req: func.HttpRequest, starter: str) -> func.HttpResponse:
try:
client = df.DurableOrchestrationClient(starter)
params = {}
for key, value in req.params.items():
params[key] = value
for key, value in _get_json(req).items():
params[key] = value
access_token = params["accessToken"]
azure_call_id = params.get("azureCallId")
if not azure_call_id:
azure_call_id = await client.start_new('TestOrchestrator',
instance_id=None,
client_input=params)
status = await client.get_status(azure_call_id)
if status.runtime_status == OrchestrationRuntimeStatus.Pending \
or status.runtime_status == OrchestrationRuntimeStatus.Running \
or status.runtime_status == OrchestrationRuntimeStatus.ContinuedAsNew:
await client.raise_event(azure_call_id, 'RefreshToken', {'accessToken': access_token})
return response
except Exception as e:
logging.error(f"Exception caught: {str(e)}")

handling async streaming request in grpc python

I am trying to understand how to handle a grpc api with bidirectional streaming (using the Python API).
Say I have the following simple server definition:
syntax = "proto3";
package simple;
service TestService {
rpc Translate(stream Msg) returns (stream Msg){}
}
message Msg
{
string msg = 1;
}
Say that the messages that will be sent from the client come asynchronously ( as a consequence of user selecting some ui elements).
The generated python stub for the client will contain a method Translate that will accept a generator function and will return an iterator.
What is not clear to me is how would I write the generator function that will return messages as they are created by the user. Sleeping on the thread while waiting for messages doesn't sound like the best solution.

This is a bit clunky right now, but you can accomplish your use case as follows:
#!/usr/bin/env python
from __future__ import print_function
import time
import random
import collections
import threading
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor
import grpc
from translate_pb2 import Msg
from translate_pb2_grpc import TestServiceStub
from translate_pb2_grpc import TestServiceServicer
from translate_pb2_grpc import add_TestServiceServicer_to_server
def translate_next(msg):
return ''.join(reversed(msg))
class Translator(TestServiceServicer):
def Translate(self, request_iterator, context):
for req in request_iterator:
print("Translating message: {}".format(req.msg))
yield Msg(msg=translate_next(req.msg))
class TranslatorClient(object):
def __init__(self):
self._stop_event = threading.Event()
self._request_condition = threading.Condition()
self._response_condition = threading.Condition()
self._requests = collections.deque()
self._last_request = None
self._expected_responses = collections.deque()
self._responses = {}
def _next(self):
with self._request_condition:
while not self._requests and not self._stop_event.is_set():
self._request_condition.wait()
if len(self._requests) > 0:
return self._requests.popleft()
else:
raise StopIteration()
def next(self):
return self._next()
def __next__(self):
return self._next()
def add_response(self, response):
with self._response_condition:
request = self._expected_responses.popleft()
self._responses[request] = response
self._response_condition.notify_all()
def add_request(self, request):
with self._request_condition:
self._requests.append(request)
with self._response_condition:
self._expected_responses.append(request.msg)
self._request_condition.notify()
def close(self):
self._stop_event.set()
with self._request_condition:
self._request_condition.notify()
def translate(self, to_translate):
self.add_request(to_translate)
with self._response_condition:
while True:
self._response_condition.wait()
if to_translate.msg in self._responses:
return self._responses[to_translate.msg]
def _run_client(address, translator_client):
with grpc.insecure_channel('localhost:50054') as channel:
stub = TestServiceStub(channel)
responses = stub.Translate(translator_client)
for resp in responses:
translator_client.add_response(resp)
def main():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
add_TestServiceServicer_to_server(Translator(), server)
server.add_insecure_port('[::]:50054')
server.start()
translator_client = TranslatorClient()
client_thread = threading.Thread(
target=_run_client, args=('localhost:50054', translator_client))
client_thread.start()
def _translate(to_translate):
return translator_client.translate(Msg(msg=to_translate)).msg
translator_pool = futures.ThreadPoolExecutor(max_workers=4)
to_translate = ("hello", "goodbye", "I", "don't", "know", "why",)
translations = translator_pool.map(_translate, to_translate)
print("Translations: {}".format(zip(to_translate, translations)))
translator_client.close()
client_thread.join()
server.stop(None)
if __name__ == "__main__":
main()
The basic idea is to have an object called TranslatorClient running on a separate thread, correlating requests and responses. It expects that responses will return in the order that requests were sent out. It also implements the iterator interface so that you can pass it directly to an invocation of the Translate method on your stub.
We spin up a thread running _run_client which pulls responses out of TranslatorClient and feeds them back in the other end with add_response.
The main function I included here is really just a strawman since I don't have the particulars of your UI code. I'm running _translate in a ThreadPoolExecutor to demonstrate that, even though translator_client.translate is synchronous, it yields, allowing you to have multiple in-flight requests at once.
We recognize that this is a lot of code to write for such a simple use case. Ultimately, the answer will be asyncio support. We have plans for this in the not-too-distant future. But for the moment, this sort of solution should keep you going whether you're running python 2 or python 3.

Avoid error 420s with streaming API tweepy

I made a python script which uses tweepy streaming module to stream mentions to a twitter account and carry some functions based on the status text.
I wanted it to stream until a mention is made, next stop streaming, carry some functions based on the status text and again start streaming.
This is my code:
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
tweet = json.loads(data.strip())
global d
d=tweet
return False #stops streaming after a tweet is fed to it
def on_error(self, status_code):
print(status_code)
time.sleep(120)
return F # To continue listening
def on_timeout(self)
time.sleep(120)
return True # To continue listening
while True:
d={}
listener = StdOutListener()
stream = tweepy.Stream(twitter_auth(tokens), listener)
stream.filter(track=['#xxx'])
stream.disconnect()
doSomething(d)
But it only works for one loop and later shows 420(Exceeding Rate Limit) errors,even though I just take in a single tweet (per stream, if I'm not wrong).
Can anyone please explain where I'm doing it wrong? And also when should we use async mode in tweepy stream listener?

Running Time Estimate for Stream Twitter with Location Filter in Tweepy

PROBLEM SOLVED, SEE SOLUTION AT THE END OF THE POST
I need help to estimate running time for my tweepy program calling Twitter Stream API with location filter.
After I kicked it off, it has run for over 20 minutes, which is longer than what I expected. I am new to Twitter Stream API, and have only worked with REST API for couple of days. It looks to me that REST API will give me 50 tweets in a few seconds, easy. But this Stream request is taking a lot more time. My program hasn't died on me or given any error. So I don't know if there's anything wrong with it. If so, please do point out.
In conclusion, if you think my code is correct, could you provide an estimate for the running time? If you think my code is wrong, could you help me to fix it?
Thank you in advance!
Here's the code:
# Import Tweepy, sys, sleep, credentials.py
import tweepy, sys
from time import sleep
from credentials import *
# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
box = [-86.33,41.63,-86.20,41.74]
class CustomStreamListener(tweepy.StreamListener):
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
stream = tweepy.streaming.Stream(auth, CustomStreamListener()).filter(locations=box).items(50)
stream
I tried the method from http://docs.tweepy.org/en/v3.4.0/auth_tutorial.html#auth-tutorial Apparently it is not working for me... Here is my code below. Would you mind giving any input? Let me know if you have some working code. Thanks!
# Import Tweepy, sys, sleep, credentials.py
import tweepy, sys
from time import sleep
from credentials import *
# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Assign coordinates to the variable
box = [-74.0,40.73,-73.0,41.73]
import tweepy
#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text)
def on_error(self, status_code):
if status_code == 420:
#returning False in on_data disconnects the stream
return False
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
myStream.filter(track=['python'], locations=(box), async=True)
Here is the error message:
Traceback (most recent call last):
File "test.py", line 26, in <module>
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
TypeError: 'MyStreamListener' object is not callable
PROBLEM SOLVED! SEE SOLUTION BELOW
After another round of debug, here is the solution for one who may have interest in the same topic:
# Import Tweepy, sys, sleep, credentials.py
try:
import json
except ImportError:
import simplejson as json
import tweepy, sys
from time import sleep
from credentials import *
# Access and authorize our Twitter credentials from credentials.py
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Assign coordinates to the variable
box = [-74.0,40.73,-73.0,41.73]
import tweepy
#override tweepy.StreamListener to add logic to on_status
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text.encode('utf-8'))
def on_error(self, status_code):
if status_code == 420:
#returning False in on_data disconnects the stream
return False
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(api.auth, listener=myStreamListener)
myStream.filter(track=['NYC'], locations=(box), async=True)

Core Problem:
I think you're misunderstanding what the Stream is here.
Tl;dr: Your code is working, you're just not doing anything with the data that gets back.
The rest API call is a single call for information. You make a request, Twitter sends back some information, which gets assigned to your variable.
The StreamObject (which you've created as stream) from Tweepy opens a connection to twitter with your search parameters, and Twitter, well, streams Tweets to it. Forever.
From the Tweepy docs:
The streaming api is quite different from the REST api because the
REST api is used to pull data from twitter but the streaming api
pushes messages to a persistent session. This allows the streaming api
to download more data in real time than could be done using the REST
API.
So, you need to build a handler (streamListener, in tweepy's terminology), like this one that prints out the tweets..
Additional
Word of warning, from bitter experience - if you're going to try and save the tweets to a database: Twitter can, and will, stream objects to you much faster than you can save them to the database. This will result in your Stream being disconnected, because the tweets back up at Twitter, and over a certain level of backed-up-ness (not an actual phrase), they'll just disconnect you.
I handled this by using django-rq to put save jobs into a jobqueue - this way, I could handle hundreds of tweets a second (at peak), and it would smooth out. You can see how I did this below. Python-rq would also work if you're not using django as a framework round it. The read both method is just a function that reads from the tweet and saves it to a postgres database. In my specific case, I did that via the Django ORM, using the django_rq.enqueue function.
__author__ = 'iamwithnail'
from django.core.management.base import BaseCommand, CommandError
from django.db.utils import DataError
from harvester.tools import read_both
import django_rq
class Command(BaseCommand):
args = '<search_string search_string>'
help = "Opens a listener to the Twitter stream, and tracks the given string or list" \
"of strings, saving them down to the DB as they are received."
def handle(self, *args, **options):
try:
import urllib3.contrib.pyopenssl
urllib3.contrib.pyopenssl.inject_into_urllib3()
except ImportError:
pass
consumer_key = '***'
consumer_secret = '****'
access_token='****'
access_token_secret_var='****'
import tweepy
import json
# This is the listener, responsible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
decoded = json.loads(data)
try:
if decoded['lang'] == 'en':
django_rq.enqueue(read_both, decoded)
else:
pass
except KeyError,e:
print "Error on Key", e
except DataError, e:
print "DataError", e
return True
def on_error(self, status):
print status
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret_var)
stream = tweepy.Stream(auth, l)
stream.filter(track=args)
Edit: Your subsequent problem is caused by calling the listener wrongly.
myStreamListener = MyStreamListener() #creates an instance of your class
Where you have this:
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener())
You're trying to call the listener as a function when you use the (). So it should be:
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)
And in fact, can probably just be more succinctly written as:
myStream = tweepy.Stream(api.auth,myStreamListener)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stream Tweets asynchronously with Tweepy AsyncStream - python

Related

How to automatically generate partition keys for messages (Kafka + Python)?

Sending external event to Durable Function Orchestrator causes Orchestrator to die

handling async streaming request in grpc python

Avoid error 420s with streaming API tweepy

Running Time Estimate for Stream Twitter with Location Filter in Tweepy

Categories

Resources