I'm trying to perform some more in-depth PII detection as the standard code that might be found here: https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/personally-identifiable-information/quickstart?pivots=programming-language-python fails to find some more detailed entities (like French registration plates number, for example).
Everything works fine when I use the standard endpoint: 'https://whatever.cognitiveservices.azure.com/'
However, when I switch to 'https://whatever.cognitiveservices.azure.com/text/analytics/v3.1/entities/recognition/pii?piiCategories=default,FRDriversLicenseNumber" (an example found here: https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/personally-identifiable-information/how-to-call ) I get an 404 error.
I believe it might be the Python SDK Issue, as when I try the API console - it works just fine. https://westus2.dev.cognitive.microsoft.com/docs/services/TextAnalytics-v3-1/operations/EntitiesRecognitionPii
The code:
key = "key"
endpoint = "https://whatever.cognitiveservices.azure.com/text/analytics/v3.1/entities/recognition/pii?piiCategories=default,FRDriversLicenseNumber/"
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
# Authenticate the client using your key and endpoint
def authenticate_client():
ta_credential = AzureKeyCredential(key)
text_analytics_client = TextAnalyticsClient(
endpoint=endpoint,
credential=ta_credential)
return text_analytics_client
client = authenticate_client()
# Example method for detecting sensitive information (PII) from text
def pii_recognition_example(client):
documents = [
"The employee's SSN is 859-98-0987.",
"The employee's phone number is 555-555-5555."
]
response = client.recognize_pii_entities(documents, language="en")
result = [doc for doc in response if not doc.is_error]
for doc in result:
print("Redacted Text: {}".format(doc.redacted_text))
for entity in doc.entities:
print("Entity: {}".format(entity.text))
print("\tCategory: {}".format(entity.category))
print("\tConfidence Score: {}".format(entity.confidence_score))
print("\tOffset: {}".format(entity.offset))
print("\tLength: {}".format(entity.length))
pii_recognition_example(client)
As it is not stated in the MS docs yet, the endpoint should be kept simple:
endpoint = "https://.cognitiveservices.azure.com"
and the details passed to the response = client.recognize_pii_entities().
The below code works just fine:
key = "key"
endpoint = "https://<name>.cognitiveservices.azure.com"
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
# Authenticate the client using your key and endpoint
def authenticate_client():
ta_credential = AzureKeyCredential(key)
text_analytics_client = TextAnalyticsClient(
endpoint=endpoint,
credential=ta_credential)
return text_analytics_client
client = authenticate_client()
# Example method for detecting sensitive information (PII) from text
def pii_recognition_example(client):
documents = [
"The employee's SSN is 859-98-0987.",
"The employee's phone number is 555-555-5555."
]
response = client.recognize_pii_entities(documents, language="en", categories_filter=["default", "FRDriversLicenseNumber"])
result = [doc for doc in response if not doc.is_error]
for doc in result:
print("Redacted Text: {}".format(doc.redacted_text))
for entity in doc.entities:
print("Entity: {}".format(entity.text))
print("\tCategory: {}".format(entity.category))
print("\tConfidence Score: {}".format(entity.confidence_score))
print("\tOffset: {}".format(entity.offset))
print("\tLength: {}".format(entity.length))
pii_recognition_example(client)
I have been working on this for hours and need some help. This mostly works. I am able to connect to Twitter, pull the json data and store it in MongoDB however not all the data that I am seeing in my 'print(tweet)' line is showing up in MongoDB. Specifically I didn't see the screen_name (or name or the matter) field. I really just need these fields: "id", "text", "created_at", "screen_name", "retweet_count", "favourites_count", "lang" and I get them all but the name. I am not sure why it is not being inserted in the DB with all the other fields. Any help would be greatly appreciated!
from twython import Twython
from pymongo import MongoClient
ConsumerKey = "XXXXX"
ConsumerSecret = "XXXXX"
AccessToken = "XXXXX-XXXXX"
AccessTokenSecret = "XXXXX"
twitter = Twython(ConsumerKey,
ConsumerSecret,
AccessToken,
AccessTokenSecret)
result = twitter.search(q="drexel", count='100')
result1 = result['statuses']
for tweet in result1:
print(tweet) #prints tweets so I know I got data
client = MongoClient('mongodb://localhost:27017/')
db = client.twitterdb
tweet_collection = db.twitter_search
#Fields I need ["id", "text", "created_at", "screen_name", "retweet_count", "favourites_count", "lang"]
for tweet in result1:
try:
tweet_collection.insert(tweet)
except:
pass
print("The number of tweets in English: ")
print(tweet_collection.count(lang="en"))
You can use following way:
def get_document(post):
return {
'id': post['id_str'],
'text': post['text'],
'created_at': post['created_at'],
'retweet_count' : post['retweet_count'],
'favourites_count': post['user']['favourites_count'],
'lang': post['lang'],
'screen_name': post['user']['screen_name']
}
for tweet in result1:
try:
tweet_collection.insert(
get_document(tweet)
)
except:
pass
It should work.
The "screen_name" field is a subset of the "user" part of the tweet metadata. Make sure you're drilling down far enough.
I am quite new to Twitter API and Tweepy and I am confused with the rate-limiting concept, I am using the streaming API and I want to gather sample tweets without using any filters such as hashtags or location, some sources state I should not get rate limited with sample tweets as I am getting 1% of tweets and some state otherwise. I keep getting error 420 very often and I was wondering if there is a way to avoid it or make it smoother?
Thank you so much for your help
My code:
import json
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from textblob import TextBlob
from elasticsearch import Elasticsearch
from datetime import datetime
# import twitter keys and tokens
from config import *
# create instance of elasticsearch
es = Elasticsearch()
indexName = "test_new_fields"
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
class TweetStreamListener(StreamListener):
hashtags = []
# on success
def on_data(self, data):
# decode json
dict_data = json.loads(data) # data is a json string
# print(data) # to print the twitter json string
print(dict_data)
# pass tweet into TextBlob
tweet = TextBlob(dict_data["text"])
# determine if sentiment is positive, negative, or neutral
if tweet.sentiment.polarity < 0:
sentiment = "negative"
elif tweet.sentiment.polarity == 0:
sentiment = "neutral"
else:
sentiment = "positive"
# output polarity sentiment and tweet text
print (str(tweet.sentiment.polarity) + " " + sentiment + " " + dict_data["text"])
try:
#check if there r any hashtags
if len(dict_data["entities"]["hashtags"]) != 0:
hashtags = dict_data["entities"]["hashtags"]
#if no hashtags add empty
else:
hashtags= []
except:
pass
es.indices.put_settings(index=indexName, body={"index.blocks.write":False})
# add text and sentiment info to elasticsearch
es.index(index=indexName,
doc_type="test-type",
body={"author": dict_data["user"]["screen_name"],
"date": dict_data["created_at"], # unfortunately this gets stored as a string
"location": dict_data["user"]["location"], # user location
"followers": dict_data["user"]["followers_count"],
"friends": dict_data["user"]["friends_count"],
"time_zone": dict_data["user"]["time_zone"],
"lang": dict_data["user"]["lang"],
#"timestamp": float(dict_data["timestamp_ms"]), # double not recognised as date
"timestamp": dict_data["timestamp_ms"],
"datetime": datetime.now(),
"message": dict_data["text"],
"hashtags": hashtags,
"polarity": tweet.sentiment.polarity,
"subjectivity": tweet.sentiment.subjectivity,
# handle geo data
#"coordinates": dict_data[coordinates],
"sentiment": sentiment})
return True
# on failure
def on_error(self, error):
print "error: " + str(error)
if __name__ == '__main__':
# create instance of the tweepy tweet stream listener
listener = TweetStreamListener()
# set twitter keys/tokens
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
while True:
try:
#create instance of the tweepy stream
stream = Stream(auth, listener)
# search twitter for sample tweets
stream.sample()
except KeyError:
pass
Ok, I have found the solution for this problem, changing the method from on_data to on_status removed all the issues causing the error 420.
This is my code in python
import tweepy
import csv
consumer_key = "?"
consumer_secret = "?"
access_token = "?"
access_token_secret = "?"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
search_tweets = api.search('trump',count=1,tweet_mode='extended')
print(search_tweets[0].full_text)
print(search_tweets[0].id)
and the output is the following tweet
RT #CREWcrew: When Ivanka Trump has business interests across the world, we have to
ask if sheโs representing the United States or her busiโฆ
967462205561212929
which is truncated, although I used tweet_mode='extended'.
How can I extract the full text??
I've had the same problem as you recently, this happened for the retweets only, I found that you can find the full text under here: tweet._json['retweeted_status']['full_text']
Code snippet:
...
search_tweets = api.search('trump',count=1,tweet_mode='extended')
for tweet in search_tweets:
if 'retweeted_status' in tweet._json:
print(tweet._json['retweeted_status']['full_text'])
else:
print(tweet.full_text)
...
EDIT Also please note that this won't show RT #.... at the beginning of the text, you might wanna add RT at the start of the text, whatever suits you.
EDIT 2
You can get the name of the author of the tweet and add it as the beginning as follows
retweet_text = 'RT # ' + api.get_user(tweet.retweeted_status.user.id_str).screen_name
This worked for me :
if 'retweeted_status' in status._json:
if 'extended_tweet' in status._json['retweeted_status']:
text = 'RT #'+status._json['retweeted_status']['user']['screen_name']+':'+status._json['retweeted_status']['extended_tweet']['full_text']
else:
text = 'RT #'+status._json['retweeted_status']['user']['screen_name']+':' +status._json['retweeted_status']['text']
else:
if 'extended_tweet' in status._json:
text = status._json['extended_tweet']['full_text']
else:
text = status.text
I'm just wondering if there is any way to write a python script to check to see if a twitch.tv stream is live?
I'm not sure why my app engine tag was removed, but this would be using app engine.
Since all answers are actually outdated as of 2020-05-02, i'll give it a shot. You now are required to register a developer application (I believe), and now you must use an endpoint that requires a user-id instead of a username (as they can change).
See https://dev.twitch.tv/docs/v5/reference/users
and https://dev.twitch.tv/docs/v5/reference/streams
First you'll need to Register an application
From that you'll need to get your Client-ID.
The one in this example is not a real
TWITCH_STREAM_API_ENDPOINT_V5 = "https://api.twitch.tv/kraken/streams/{}"
API_HEADERS = {
'Client-ID' : 'tqanfnani3tygk9a9esl8conhnaz6wj',
'Accept' : 'application/vnd.twitchtv.v5+json',
}
reqSession = requests.Session()
def checkUser(userID): #returns true if online, false if not
url = TWITCH_STREAM_API_ENDPOINT_V5.format(userID)
try:
req = reqSession.get(url, headers=API_HEADERS)
jsondata = req.json()
if 'stream' in jsondata:
if jsondata['stream'] is not None: #stream is online
return True
else:
return False
except Exception as e:
print("Error checking user: ", e)
return False
I hated having to go through the process of making an api key and all those things just to check if a channel was live, so i tried to find a workaround:
As of june 2021 if you send a http get request to a url like https://www.twitch.tv/CHANNEL_NAME, in the response there will be a "isLiveBroadcast": true if the stream is live, and if the stream is not live, there will be nothing like that.
So i wrote this code as an example in nodejs:
const fetch = require('node-fetch');
const channelName = '39daph';
async function main(){
let a = await fetch(`https://www.twitch.tv/${channelName}`);
if( (await a.text()).includes('isLiveBroadcast') )
console.log(`${channelName} is live`);
else
console.log(`${channelName} is not live`);
}
main();
here is also an example in python:
import requests
channelName = '39daph'
contents = requests.get('https://www.twitch.tv/' +channelName).content.decode('utf-8')
if 'isLiveBroadcast' in contents:
print(channelName + ' is live')
else:
print(channelName + ' is not live')
It looks like Twitch provides an API (documentation here) that provides a way to get that info. A very simple example of getting the feed would be:
import urllib2
url = 'http://api.justin.tv/api/stream/list.json?channel=FollowGrubby'
contents = urllib2.urlopen(url)
print contents.read()
This will dump all of the info, which you can then parse with a JSON library (XML looks to be available too). Looks like the value returns empty if the stream isn't live (haven't tested this much at all, nor have I read anything :) ). Hope this helps!
RocketDonkey's fine answer seems to be outdated by now, so I'm posting an updated answer for people like me who stumble across this SO-question with google.
You can check the status of the user EXAMPLEUSER by parsing
https://api.twitch.tv/kraken/streams/EXAMPLEUSER
The entry "stream":null will tell you that the user if offline, if that user exists.
Here is a small Python script which you can use on the commandline that will print 0 for user online, 1 for user offline and 2 for user not found.
#!/usr/bin/env python3
# checks whether a twitch.tv userstream is live
import argparse
from urllib.request import urlopen
from urllib.error import URLError
import json
def parse_args():
""" parses commandline, returns args namespace object """
desc = ('Check online status of twitch.tv user.\n'
'Exit prints are 0: online, 1: offline, 2: not found, 3: error.')
parser = argparse.ArgumentParser(description = desc,
formatter_class = argparse.RawTextHelpFormatter)
parser.add_argument('USER', nargs = 1, help = 'twitch.tv username')
args = parser.parse_args()
return args
def check_user(user):
""" returns 0: online, 1: offline, 2: not found, 3: error """
url = 'https://api.twitch.tv/kraken/streams/' + user
try:
info = json.loads(urlopen(url, timeout = 15).read().decode('utf-8'))
if info['stream'] == None:
status = 1
else:
status = 0
except URLError as e:
if e.reason == 'Not Found' or e.reason == 'Unprocessable Entity':
status = 2
else:
status = 3
return status
# main
try:
user = parse_args().USER[0]
print(check_user(user))
except KeyboardInterrupt:
pass
Here is a more up to date answer using the latest version of the Twitch API (helix). (kraken is deprecated and you shouldn't use GQL since it's not documented for third party use).
It works but you should store the token and reuse the token rather than generate a new token every time you run the script.
import requests
client_id = ''
client_secret = ''
streamer_name = ''
body = {
'client_id': client_id,
'client_secret': client_secret,
"grant_type": 'client_credentials'
}
r = requests.post('https://id.twitch.tv/oauth2/token', body)
#data output
keys = r.json();
print(keys)
headers = {
'Client-ID': client_id,
'Authorization': 'Bearer ' + keys['access_token']
}
print(headers)
stream = requests.get('https://api.twitch.tv/helix/streams?user_login=' + streamer_name, headers=headers)
stream_data = stream.json();
print(stream_data);
if len(stream_data['data']) == 1:
print(streamer_name + ' is live: ' + stream_data['data'][0]['title'] + ' playing ' + stream_data['data'][0]['game_name']);
else:
print(streamer_name + ' is not live');
๐ Explanation
Now, the Twitch API v5 is deprecated. The helix API is in place, where an OAuth Authorization Bearer AND client-id is needed. This is pretty annoying, so I went on a search for a viable workaround, and found one.
๐ GraphQL
When inspecting Twitch's network requests, while not being logged in, I found out the anonymous API relies on GraphQL. GraphQL is a query language for APIs.
query {
user(login: "USERNAME") {
stream {
id
}
}
}
In the graphql query above, we are querying a user by their login name. If they are streaming, the stream's id will be given. If not, None will be returned.
๐ The Final Code
The finished python code, in a function, is below. The client-id is taken from Twitch's website. Twitch uses the client-id to fetch information for anonymous users. It will always work, without the need of getting your own client-id.
import requests
# ...
def checkIfUserIsStreaming(username):
url = "https://gql.twitch.tv/gql"
query = "query {\n user(login: \""+username+"\") {\n stream {\n id\n }\n }\n}"
return True if requests.request("POST", url, json={"query": query, "variables": {}}, headers={"client-id": "kimne78kx3ncx6brgo4mv6wki5h1ko"}).json()["data"]["user"]["stream"] else False
I've created a website where you can play with Twitch's GraphQL API. Refer to the GraphQL Docs for help on GraphQL syntax! There's also Twitch GraphQL API documentation on my playground.
Use the twitch api with your client_id as a parameter, then parse the json:
https://api.twitch.tv/kraken/streams/massansc?client_id=XXXXXXX
Twitch Client Id is explained here: https://dev.twitch.tv/docs#client-id,
you need to register a developer application: https://www.twitch.tv/kraken/oauth2/clients/new
Example:
import requests
import json
def is_live_stream(streamer_name, client_id):
twitch_api_stream_url = "https://api.twitch.tv/kraken/streams/" \
+ streamer_name + "?client_id=" + client_id
streamer_html = requests.get(twitch_api_stream_url)
streamer = json.loads(streamer_html.content)
return streamer["stream"] is not None
I'll try to shoot my shot, just in case someone still needs an answer to this, so here it goes
import requests
import time
from twitchAPI.twitch import Twitch
client_id = ""
client_secret = ""
twitch = Twitch(client_id, client_secret)
twitch.authenticate_app([])
TWITCH_STREAM_API_ENDPOINT_V5 = "https://api.twitch.tv/kraken/streams/{}"
API_HEADERS = {
'Client-ID' : client_id,
'Accept' : 'application/vnd.twitchtv.v5+json',
}
def checkUser(user): #returns true if online, false if not
userid = twitch.get_users(logins=[user])['data'][0]['id']
url = TWITCH_STREAM_API_ENDPOINT_V5.format(userid)
try:
req = requests.Session().get(url, headers=API_HEADERS)
jsondata = req.json()
if 'stream' in jsondata:
if jsondata['stream'] is not None:
return True
else:
return False
except Exception as e:
print("Error checking user: ", e)
return False
print(checkUser('michaelreeves'))
https://dev.twitch.tv/docs/api/reference#get-streams
import requests
# ================================================================
# your twitch client id
client_id = ''
# your twitch secret
client_secret = ''
# twitch username you want to check if it is streaming online
twitch_user = ''
# ================================================================
#getting auth token
url = 'https://id.twitch.tv/oauth2/token'
params = {
'client_id':client_id,
'client_secret':client_secret,
'grant_type':'client_credentials'}
req = requests.post(url=url,params=params)
token = req.json()['access_token']
print(f'{token=}')
# ================================================================
#getting user data (user id for example)
url = f'https://api.twitch.tv/helix/users?login={twitch_user}'
headers = {
'Authorization':f'Bearer {token}',
'Client-Id':f'{client_id}'}
req = requests.get(url=url,headers=headers)
userdata = req.json()
userid = userdata['data'][0]['id']
print(f'{userid=}')
# ================================================================
#getting stream info (by user id for example)
url = f'https://api.twitch.tv/helix/streams?user_id={userid}'
headers = {
'Authorization':f'Bearer {token}',
'Client-Id':f'{client_id}'}
req = requests.get(url=url,headers=headers)
streaminfo = req.json()
print(f'{streaminfo=}')
# ================================================================
This solution doesn't require registering an application
import requests
HEADERS = { 'client-id' : 'kimne78kx3ncx6brgo4mv6wki5h1ko' }
GQL_QUERY = """
query($login: String) {
user(login: $login) {
stream {
id
}
}
}
"""
def isLive(username):
QUERY = {
'query': GQL_QUERY,
'variables': {
'login': username
}
}
response = requests.post('https://gql.twitch.tv/gql',
json=QUERY, headers=HEADERS)
dict_response = response.json()
return True if dict_response['data']['user']['stream'] is not None else False
if __name__ == '__main__':
USERS = ['forsen', 'offineandy', 'dyrus']
for user in USERS:
IS_LIVE = isLive(user)
print(f'User {user} live: {IS_LIVE}')
Yes.
You can use Twitch API call https://api.twitch.tv/kraken/streams/YOUR_CHANNEL_NAME and parse result to check if it's live.
The below function returns a streamID if the channel is live, else returns -1.
import urllib2, json, sys
TwitchChannel = 'A_Channel_Name'
def IsTwitchLive(): # return the stream Id is streaming else returns -1
url = str('https://api.twitch.tv/kraken/streams/'+TwitchChannel)
streamID = -1
respose = urllib2.urlopen(url)
html = respose.read()
data = json.loads(html)
try:
streamID = data['stream']['_id']
except:
streamID = -1
return int(streamID)