How to get full text in Twitter search API? - python

I have used twitter search API and also incorporated the extended_mode and full_text attributes but I am still getting a truncated string from the API
Here is my code:
results = t.search(q='tuberculosis', count=50, lang='en', result_type='popular',tweet_mode='extended')
all_tweets = results['statuses']
for tweet in all_tweets:
tweetString = tweet["full_text"]
userMentionList = tweet["entities"]["user_mentions"]
if len(userMentionList)>0:
for eachUserMention in userMentionList:
name = eachUserMention["screen_name"]
time = tweet["created_at"]
wks.insert_rows(wks.rows, values=[tweetString, name, time], inherit=True)

if you are using TwitterSearch, following should work:
tso = TwitterSearchOrder()
tso.set_keywords('tuberculosis')
for tweet in ts.search_tweets_iterable(tso):
print(tweet['text'])
you can set your desired attributes like language and count of course

Related

extract dicitonary values from non-subscriptable object-type in python

i'm a python novice, trying to learn and be useful at work at the same time
we use DespatchBay to send parcels. they have a SOAP API which i don't entirely understand, and am using an SDK they released.
before booking a collection and producing a label my code queries the api to get available services, and available dates, and returns what i think are custom object-types containing the info i need. i want to extract and then print information from these objects so that i can confirm the correct details have been used.
postcode = "NW1 4RY"
street_num = 1
recipient_address = client.find_address(postcode, street_num)
print (recipient_address)
yields:
(AddressType){
CompanyName = "London Zoo"
Street = "Regents Park"
Locality = None
TownCity = "London"
County = None
PostalCode = "NW1 4RY"
CountryCode = "GB"
}
i can see there's a dictionary there, and i want to drill down into it to extract details, but i don't understand the "(AddressType)" before the dictionary - how do i get past it and call values from the dictionary?
AddressType has some ref here but it doesn't shine much light for me
thanks for any help you can offer!
full code: sdk ref
import os
from despatchbay.despatchbay_sdk import DespatchBaySDK
from pprint import pprint
api_user = os.getenv('DESPATCH_API_USER')
api_key = os.getenv('DESPATCH_API_KEY')
client = DespatchBaySDK(api_user=api_user, api_key=api_key)
sender_id = '5536'
# inputs
postcode = "NW1 4RY"
street_num = 1
customer = "Testy Mctestson"
phone = "07666666666"
email = "testicles#tested.com"
num_boxes = 2
collection_date = '2022-09-11'
recipient_address = client.find_address(postcode, street_num)
recipient = client.recipient(
name=customer,
telephone=phone,
email=email,
recipient_address=recipient_address
)
print (recipient_address)
parcels = []
parcel_names = []
for x in range(num_boxes):
parcelname = "my_parcel_" + str(x + 1)
parcel_names.append(parcelname)
for my_parcel in parcel_names:
go_parcel = client.parcel(
contents="Radios",
value=500,
weight=6,
length=60,
width=40,
height=40,
)
parcels.append(go_parcel)
sender = client.sender(
address_id=sender_id
)
shipment_request = client.shipment_request(
parcels=parcels,
client_reference=customer,
collection_date=collection_date,
sender_address=sender,
recipient_address=recipient,
follow_shipment='true'
)
services = client.get_available_services(shipment_request)
shipment_request.service_id = services[0].service_id
dates = client.get_available_collection_dates(sender, services[0].courier.courier_id)
print(customer + "'s shipment of",num_boxes, "parcels will be collected from: ",recipient['RecipientAddress'], "on", dates[0])
shipment_request.collection_date = dates[0]
added_shipment = client.add_shipment(shipment_request)
client.book_shipments([added_shipment])
shipment_return = client.get_shipment(added_shipment)
label_pdf = client.get_labels(shipment_return.shipment_document_id)
label_pdf.download('./' + customer + '.pdf')
You do not know exactly how they store the data inside of their object, even if it looks like a dictionary. you could run print(dir(recipient_address)) and see what the "inside of the object looks like". once you get the output you start inspecting which attributes or methods may have the data you want. Of course, you should always follow the published contract for interacting withe these objects, as the implementation details can always change. I've examine the source code of this object published here
https://github.com/despatchbay/despatchbay-python-sdk/blob/master/despatchbay/despatchbay_entities.py#L169
It turns out that it doesn't use a dictionary. It looks like you are meant to just access the data via attributes like follows:
recipient_address.company_name
recipient_address.street
recipient_address.locality
recipient_address.town_city
recipient_address.county
recipient_address.postal_code
recipient_address.country_code
I agree, their python sdk could use improved documentation..

How to scrape a link from a multipart email in python

I have a program which logs on to a specified gmail account and gets all the emails in a selected inbox that were sent from an email that you input at runtime.
I would like to be able to grab all the links from each email and append them to a list so that i can then filter out the ones i don't need before outputting them to another file. I was using a regex to do this which requires me to convert the payload to a string. The problem is that the regex i am using doesn't work for findall(), it only works when i use search() (I am not too familiar with regexes). I was wondering if there was a better way to extract all links from an email that doesn't involve me messing around with regexes?
My code currently looks like this:
print(f'[{Mail.timestamp}] Scanning inbox')
sys.stdout.write(Style.RESET)
self.search_mail_status, self.amount_matching_criteria = self.login_session.search(Mail.CHARSET,search_criteria)
if self.amount_matching_criteria == 0 or self.amount_matching_criteria == '0':
print(f'[{Mail.timestamp}] No mails from that email address could be found...')
Mail.enter_to_continue()
import main
main.main_wrapper()
else:
pattern = '(?P<url>https?://[^\s]+)'
prog = re.compile(pattern)
self.amount_matching_criteria = self.amount_matching_criteria[0]
self.amount_matching_criteria_str = str(self.amount_matching_criteria)
num_mails = re.search(r"\d.+",self.amount_matching_criteria_str)
num_mails = ((num_mails.group())[:-1]).split(' ')
sys.stdout.write(Style.GREEN)
print(f'[{Mail.timestamp}] Status code of {self.search_mail_status}')
sys.stdout.write(Style.RESET)
sys.stdout.write(Style.YELLOW)
print(f'[{Mail.timestamp}] Found {len(num_mails)} emails')
sys.stdout.write(Style.RESET)
num_mails = self.amount_matching_criteria.split()
for message_num in num_mails:
individual_response_code, individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
message = email.message_from_bytes(individual_response_data[0][1])
if message.is_multipart():
print('multipart')
multipart_payload = message.get_payload()
for sub_message in multipart_payload:
string_payload = str(sub_message.get_payload())
print(prog.search(string_payload).group("url"))
Ended up using this for loop with a recursive function and a regex to get the links, i then removed all links without a the substring that you can input earlier on in the program before appending to a set
for message_num in self.amount_matching_criteria.split():
counter += 1
_, self.individual_response_data = self.login_session.fetch(message_num, '(RFC822)')
self.raw = email.message_from_bytes(self.individual_response_data[0][1])
raw = self.raw
self.scraped_email_value = email.message_from_bytes(Mail.scrape_email(raw))
self.scraped_email_value = str(self.scraped_email_value)
self.returned_links = prog.findall(self.scraped_email_value)
for i in self.returned_links:
if self.substring_filter in i:
self.link_set.add(i)
self.timestamp = time.strftime('%H:%M:%S')
print(f'[{self.timestamp}] Links scraped: [{counter}/{len(num_mails)}]')
The function used:
def scrape_email(raw):
if raw.is_multipart():
return Mail.scrape_email(raw.get_payload(0))
else:
return raw.get_payload(None,True)

Most efficient way to Twitter Stream?

My partner and I started learning Python at the beginning of the year. I am at the point where a) my partner and I are almost finished with our code, but b) are pulling our hair out trying to get it to work.
Assignment: Pull 250 tweets based on a certain topic, geocode location of tweets, analyze based on sentiment, then display them on a web-map. We have accomplished almost all of that except the 250 tweets requirement.
And I do not know how to pull the tweets more efficiently. The code works, but it writes around seven-twelve rows of information onto a CSV before it times out.
I tried setting a tracking parameter, but received this error: TypeError: 'NoneType' object is not subscriptable'
I tried expanding the locations parameter to stream.filter(locations=[-180,-90,180,90]), but received the same problem: TypeError: 'NoneType' object has no attribute 'latitude'
I really do not know what I am missing and I was wondering if anyone has any ideas.
CODE BELOW:
from geopy import geocoders
from geopy.exc import GeocoderTimedOut
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from textblob import TextBlob
import json
import csv
def geo(location):
g = geocoders.Nominatim(user_agent='USER')
if location is not None:
loc = g.geocode(location, timeout=None)
if loc.latitude and loc.longitude is not None:
return loc.latitude, loc.longitude
def WriteCSV(user, text, sentiment, lat, long):
f = open('D:/PATHWAY/TO/tweets.csv', 'a', encoding="utf-8")
write = csv.writer(f)
write.writerow([user, text, sentiment, lat, long])
f.close()
CK = ''
CS = ''
AK = ''
AS = ''
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AK, AS)
#By setting these values to true, our code will automatically wait as it hits its limits
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
#Now I'm going to set up a stream listener
#https://stackoverflow.com/questions/20863486/tweepy-streaming-stop-collecting-tweets-at-x-amount
#https://wafawaheedas.gitbooks.io/twitter-sentiment-analysis-visualization-tutorial/sentiment-analysis-using-textblob.html
class StdOutListener(tweepy.StreamListener):
def __init__(self, api=None):
super(StdOutListener, self).__init__()
self.num_tweets = 0
def on_data(self, data):
Data = json.loads(data)
Author = Data['user']['screen_name']
Text = Data['text']
Tweet = TextBlob(Data["text"])
Sentiment = Tweet.sentiment.polarity
x,y = geo(Data['place']['full_name'])
if "coronavirus" in Text:
WriteCSV(Author, Text, Sentiment, x,y)
self.num_tweets += 1
if self.num_tweets < 50:
return True
else:
return False
stream = tweepy.Stream(auth=api.auth, listener=StdOutListener())
stream.filter(locations=[-122.441, 47.255, -122.329, 47.603])
The Twitter and Geolocation API returns all kinds of data. Some of the fields may be missing.
TypeError: 'NoneType' object has no attribute 'latitude'
This error comes from here:
loc = g.geocode(location, timeout=None)
if loc.latitude and loc.longitude is not None:
return loc.latitude, loc.longitude
You provide a location and it searches for such location but it cannot find that location. So it writes into loc None.
Consequently loc.latitude won't work because loc is None.
You should check loc first before accessing any of its attributes.
x,y = geo(Data['place']['full_name'])
I know you are filtering tweets by location and consequently your Twitter Status object should have Data['place']['full_name']. But this is not always the case. You should check if the key really do exist before accessing the values.
This applies generally and should be applied to your whole code. Write robust code. You will have a bit of easier time debugging mistakes if you implement some try catch and print out the objects to see how they are built. Maybe set a breakpoint in your catch and do some live inspection.

Spotipy: How to read more than 100 tracks from a playlist

I'm trying to pull all tracks in a certain playlist using the Spotipy library for python.
The user_playlist_tracks function is limited to 100 tracks, regardless of the parameter limit. The Spotipy documentation describes it as:
user_playlist_tracks(user, playlist_id=None, fields=None, limit=100,
offset=0, market=None)
Get full details of the tracks of a playlist
owned by a user.
Parameters:
user
the id of the user playlist_id
the id of the playlist fields
which fields to return limit
the maximum number of tracks to return offset
the index of the first track to return market
an ISO 3166-1 alpha-2 country code.
After authenticating with Spotify, I'm currently using something like this:
username = xxxx
playlist = #fromspotipy
sp_playlist = sp.user_playlist_tracks(username, playlist_id=playlist)
tracks = sp_playlist['items']
print tracks
Is there a way to return more than 100 tracks? I've tried setting the limit=None in the function parameters, but it returns an error.
Many of the spotipy methods return paginated results, so you will have to scroll through them to view more than just the max limit. I've encountered this most often when collecting a playlist's full track listing and consequently created a custom method to handle this:
def get_playlist_tracks(username,playlist_id):
results = sp.user_playlist_tracks(username,playlist_id)
tracks = results['items']
while results['next']:
results = sp.next(results)
tracks.extend(results['items'])
return tracks
I wrote a function that can output Panda's DataFrame where it pulls all the metadata (not all of it because I didn't want to, but you can make some space for that) for playlists over 100 songs. I do it by iterating over every song, finding the metadata for each, saving the metadata to a dictionary, and then concatenating the dictionary to the DataFrame. It takes your username and the Playlist ID as input.
# Function to extract MetaData from a playlist thats longer than 100 songs
def get_playlist_tracks_more_than_100_songs(username, playlist_id):
results = sp.user_playlist_tracks(username,playlist_id)
tracks = results['items']
while results['next']:
results = sp.next(results)
tracks.extend(results['items'])
results = tracks
playlist_tracks_id = []
playlist_tracks_titles = []
playlist_tracks_artists = []
playlist_tracks_first_artists = []
playlist_tracks_first_release_date = []
playlist_tracks_popularity = []
for i in range(len(results)):
print(i) # Counter
if i == 0:
playlist_tracks_id = results[i]['track']['id']
playlist_tracks_titles = results[i]['track']['name']
playlist_tracks_first_release_date = results[i]['track']['album']['release_date']
playlist_tracks_popularity = results[i]['track']['popularity']
artist_list = []
for artist in results[i]['track']['artists']:
artist_list= artist['name']
playlist_tracks_artists = artist_list
features = sp.audio_features(playlist_tracks_id)
features_df = pd.DataFrame(data=features, columns=features[0].keys())
features_df['title'] = playlist_tracks_titles
features_df['all_artists'] = playlist_tracks_artists
features_df['popularity'] = playlist_tracks_popularity
features_df['release_date'] = playlist_tracks_first_release_date
features_df = features_df[['id', 'title', 'all_artists', 'popularity', 'release_date',
'danceability', 'energy', 'key', 'loudness',
'mode', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo',
'duration_ms', 'time_signature']]
continue
else:
try:
playlist_tracks_id = results[i]['track']['id']
playlist_tracks_titles = results[i]['track']['name']
playlist_tracks_first_release_date = results[i]['track']['album']['release_date']
playlist_tracks_popularity = results[i]['track']['popularity']
artist_list = []
for artist in results[i]['track']['artists']:
artist_list= artist['name']
playlist_tracks_artists = artist_list
features = sp.audio_features(playlist_tracks_id)
new_row = {'id':[playlist_tracks_id],
'title':[playlist_tracks_titles],
'all_artists':[playlist_tracks_artists],
'popularity':[playlist_tracks_popularity],
'release_date':[playlist_tracks_first_release_date],
'danceability':[features[0]['danceability']],
'energy':[features[0]['energy']],
'key':[features[0]['key']],
'loudness':[features[0]['loudness']],
'mode':[features[0]['mode']],
'acousticness':[features[0]['acousticness']],
'instrumentalness':[features[0]['instrumentalness']],
'liveness':[features[0]['liveness']],
'valence':[features[0]['valence']],
'tempo':[features[0]['tempo']],
'duration_ms':[features[0]['duration_ms']],
'time_signature':[features[0]['time_signature']]
}
dfs = [features_df, pd.DataFrame(new_row)]
features_df = pd.concat(dfs, ignore_index = True)
except:
continue
return features_df
Another way around it would be to write a for loop and do:
offset +=100
then you could concatenate the tracks at the end, or put them in a data frame.
Function Ref:
playlist_tracks(playlist_id, fields=None, limit=100, offset=0, market=None)
Reference: https://spotipy.readthedocs.io/en/2.7.0/#spotipy.client.Spotify.playlist_tracks
Below is the user_playlist_tracks module used in spotipy. (notice it defaults to 100 limit).
Try setting the limit to 200.
def user_playlist_tracks(self, user, playlist_id = None, fields=None,
limit=100, offset=0):
''' Get full details of the tracks of a playlist owned by a user.
Parameters:
- user - the id of the user
- playlist_id - the id of the playlist
- fields - which fields to return
- limit - the maximum number of tracks to return
- offset - the index of the first track to return
'''
plid = self._get_id('playlist', playlist_id)
return self._get("users/%s/playlists/%s/tracks" % (user, plid),
limit=limit, offset=offset, fields=fields)
When trying the above solutions I got key error messages. I eventually figured it out. Here is my solution. This is only for displaying tracks/artists on the next pages.
id = "5lrkIjzukk65X4ksulpA0H?si=9db60a70278a4fd6"
results = sp.playlist_items(id)
tracks = results['tracks']
next_pages = 14
track_list = []
for i in range(next_pages):
tracks = sp.next(tracks)
for y in range(0,100):
try:
track = tracks['items'][y]['track']['name']
artist = tracks['items'][y]['track']['artists'][0]['name']
track_list.append(artist)
except:
continue
print(track_list)
It's unfortunate SpotiPy makes their API Access to complicated. Try using SpotifyR in r, and you can accomplish this in a few lines of code. No loops, lists, extra variables, or appending required. Then just pop it back into python if you'd like.
library(spotifyr)
df <- get_playlist_audio_features('playlist_owner_username', 'playlist_uri')
And boom, you're done. I'm not sure what the max is, but I know it's over 300 songs because I have pulled that in.

return actual tweets in tweepy?

I was writing a twitter program using tweepy. When I run this code, it prints the Python ... values for them, like
<tweepy.models.Status object at 0x95ff8cc>
Which is not good. How do I get the actual tweet?
import tweepy, tweepy.api
key = XXXXX
sec = XXXXX
tok  = XXXXX
tsec = XXXXX
auth = tweepy.OAuthHandler(key, sec)
auth.set_access_token(tok, tsec)
api = tweepy.API(auth)
pub = api.home_timeline()
for i in pub:
        print str(i)
In general, you can use the dir() builtin in Python to inspect an object.
It would seem the Tweepy documentation is very lacking here, but I would imagine the Status objects mirror the structure of Twitter's REST status format, see (for example) https://dev.twitter.com/docs/api/1/get/statuses/home_timeline
So -- try
print dir(status)
to see what lives in the status object
or just, say,
print status.text
print status.user.screen_name
Have a look at the getstate() get method which can be used to inspect the returned object
for i in pub:
print i.__getstate__()
The api.home_timeline() method returns a list of 20 tweepy.models.Status objects which correspond to the top 20 tweets. That is, each Tweet is considered as an object of Status class. Each Status object has a number of attributes like id, text, user, place, created_at, etc.
The following code would print the tweet id and the text :
tweets = api.home_timeline()
for tweet in tweets:
print tweet.id, " : ", tweet.text
from actual tweets,if u want specific tweet,u must have a tweet id,
and use
tweets = self.api.statuses_lookup(tweetIDs)
for tweet in tweets:
#tweet obtained
print(str(tweet['id'])+str(tweet['text']))
or if u want tweets in general
use twitter stream api
class StdOutListener(StreamListener):
def __init__(self, outputDatabaseName, collectionName):
try:
print("Connecting to database")
conn=pymongo.MongoClient()
outputDB = conn[outputDatabaseName]
self.collection = outputDB[collectionName]
self.counter = 0
except pymongo.errors.ConnectionFailure as e:
print ("Could not connect to MongoDB:")
def on_data(self,data):
datajson=json.loads(data)
if "lang" in datajson and datajson["lang"] == "en" and "text" in datajson:
self.collection.insert(datajson)
text=datajson["text"].encode("utf-8") #The text of the tweet
self.counter += 1
print(str(self.counter) + " " +str(text))
def on_error(self, status):
print("ERROR")
print(status)
def on_connect(self):
print("You're connected to the streaming server.
l=StdOutListener(dbname,cname)
auth=OAuthHandler(Auth.consumer_key,Auth.consumer_secret)
auth.set_access_token(Auth.access_token,Auth.access_token_secret)
stream=Stream(auth,l)
stream.filter(track=stopWords)
create a class Stdoutlistener which is inherited from StreamListener
override function on_data,and tweet is returned in json format,this function runs every time tweet is obtained
tweets are filtered accrding to stopwords
which is list of u words u wants in ur tweets
On a tweepy Status instance you can can access the _json attribute, which returns a dict representing the original Tweet contents.
For example:
type(status)
# tweepy.models.Status
type(status._json)
# dict
status._json.keys()
# dict_keys(['favorite_count', 'contributors', 'id', 'user', ...])

Categories

Resources