I am using the tweepy and geocode packages to convert ZipCodes to lat and long to then pull from the twitter API using tweepy however I am not getting anything to return. I have gone through and executed my code line by line and get stuck on the api.search every time returning nothing.
query = 'stack'
radius = 1000
DataSet = pd.DataFrame
loopCount = 0
appended_data = []
appendData = []
def toDataFrame(tweets):
DataSet = pd.DataFrame()
DataSet['tweetID'] = [tweet.id for tweet in tweets]
DataSet['tweetText'] = [tweet.text for tweet in tweets]
DataSet['tweetRetweetCt'] = [tweet.retweet_count for tweet in tweets]
DataSet['tweetFavoriteCt'] = [tweet.favorite_count for tweet in tweets]
DataSet['tweetSource'] = [tweet.source for tweet in tweets]
DataSet['tweetCreated'] = [tweet.created_at for tweet in tweets]
DataSet['userID'] = [tweet.user.id for tweet in tweets]
DataSet['userScreen'] = [tweet.user.screen_name for tweet in tweets]
DataSet['userName'] = [tweet.user.name for tweet in tweets]
DataSet['userCreateDt'] = [tweet.user.created_at for tweet in tweets]
DataSet['userDesc'] = [tweet.user.description for tweet in tweets]
DataSet['userFollowerCt'] = [tweet.user.followers_count for tweet in tweets]
DataSet['userFriendsCt'] = [tweet.user.friends_count for tweet in tweets]
DataSet['userLocation'] = [tweet.user.location for tweet in tweets]
DataSet['userTimezone'] = [tweet.user.time_zone for twee
def location(zip):
geolocator = Nominatim()
location = geolocator.geocode(zip)
cordinates = ((location.latitude, location.longitude))
cordinates = str(cordinates)
cordinates = cordinates.replace("(","")
cordinates = cordinates.replace(")","")
return cordinates
def lookUp(results):
for result in results:
DataSet = pd.DataFrame(results)
print DataSet
return DataSet
##hidden for SO
auth = tp.OAuthHandler('','')
auth.set_access_token('', '')
api = tp.API(auth)
for zip in zips:
#for row, zip in zips.iterrows():
if (loopCount == 15):
t.sleep(960)
loopCount = 0
loopCount = loopCount + 1
cordinates = location(zip)
inputCode = cordinates + ', ' + str(radius)
results = api.search(geocode=inputCode, count=100, q=query)
DataSet = lookUp(results)
appendData.append(DataSet)
appended_data = pd.concat(appendedData, axis=1)
Be careful not to pass spaces in the geocode, and also add the units. For example, using your function location,
In [5]:
zip = 28039
cordinates = location(zip)
In [23]:
radius = '1km'
inputCode = cordinates + ', ' + str(radius)
inputCode = inputCode.replace(' ', '')
inputCode
Out[23]:
'40.4604043354592,-3.70401484102134,1km'
In [24]:
query = 'a'
results = api.search(geocode=inputCode, count=100, q=query)
In [25]:
len(results)
Out[25]:
100
Reference twitter docs:
The parameter value is specified by “latitude,longitude,radius”, where
radius units must be specified as either “mi” (miles) or “km”
(kilometers).
Hope it helps.
Related
I am trying to adapt TF-IDF on my data ([using the code by Dr. W.J.B. Mattingly: https://github.com/wjbmattingly/topic_modeling_textbook/blob/main/lessons/02_tf_idf_official.py) on my data - descriptions of the startups from Startup blink website.
I cannot get the main idea on how to better deal with the extraction of all words as now the output is the string with all words all together like this - also you will notice lots of empty lists inside as well:
[['qualitygeotechnicalinvestigationtestinggeotechnicalreportspreconditiondevelopmentideasnewprojectimplementationintensivefieldlaboratorytestingsnecessaryobtaininputdatasoillayerscapacitysettlementcategorizationqualitymaterials']
s = requests.Session()
df = pd.DataFrame()
for p in tqdm(range(2000)):
r = s.get(f'https://www.startupblink.com/api/entities?entity=startups&page={p}')
d = pd.json_normalize(r.json()['page'])
df = pd.concat([df, d], axis=0, ignore_index=True)
df.to_csv('World_startups.csv')
# selecting only ESG related startups
esg = df[df['subindustry_name'].isin(['Energy', 'Energy & Environment-Other', 'Smart Cities', 'Smart Home', 'Public Transportation', 'Sustainability',
'Transportation-Other','Waste Management'])]
esg = esg[['title', 'description', 'subindustry_name']]
description = esg.description.tolist()
#description = description.remove(np.nan)
def remove_stopwords(text, stops):
words = text.split()
final = []
for word in words:
if word not in stops:
final.append(word)
final = "".join(final)
final = final.translate(str.maketrans("", "", string.punctuation))
final = "".join([i for i in final if not i.isdigit()])
while " " in final:
final = final.replace(" ", " ")
return final
def clean_docs(docs):
stops = stopwords.words('english')
final = []
for doc in docs:
clean_doc = remove_stopwords(doc, stops)
final.append(clean_doc)
return (final)
cleaned_docs = clean_docs(description)
vectorizer = TfidfVectorizer(lowercase=True,
max_features=100,
# max_df=.9,# percentage
# min_df=2, # number of
ngram_range=(1,3),
stop_words = 'english') # up to triagrams
vectors = vectorizer.fit_transform(cleaned_docs)
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
# Printing all unique dense values to mid-check
densearray = numpy.array(denselist)
print(numpy.unique(densearray))
all_keywords = []
for d in denselist:
x=0
keywords = []
for word in d:
if word > 0:
keywords.append(feature_names[x])
x=x+1
all_keywords.append(keywords)
all_keywords[7]
print(len(all_keywords))
# the list contains lots of emptly lists inside - will remove them
all_keywords = [ele for ele in all_keywords if ele != []]
print('')
print(len(all_keywords))
print(all_keywords[7])
Working on a project to scrape billboard top 100 over multiple weeks, look up song audio features using Spotify's API, and save the info in a new pandas df.
I got this to work for up to 100 searches at a time (the spotify api only allows 100 ids), but I am having trouble writing code for iterating through the song ids 100 at a time, running the api, and saving into a new df.
Below is the working code for 100 id searches at a time:
df_import = pd.read_csv(r'xxx/Billboard_Top_100.csv')
track_id_list = []
artist_name_list = []
track_name_list = []
for item, row in df_import.head(100).iterrows():
artist = row['Artist']
track = row['Song']
try:
spotify_response = sp.search(q='artist:' + artist + ' track:' + track, type='track')
#artist name
artist_name = spotify_response['tracks']['items'][0]['artists'][0]['name']
#song name
track_name = spotify_response['tracks']['items'][0]['name']
#unique sportify track id used for audio feautre search
track_id = spotify_response['tracks']['items'][0]['uri']
#splits string to search for features
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(row['Artist'])
track_name_list.append(row['Song'])
except:
DNF_song_search = sp.search(q=track)
artist_name = DNF_song_search['tracks']['items'][0]['artists'][0]['name']
if search(artist_name, artist):
#song name
track_name = DNF_song_search['tracks']['items'][0]['name']
#unique sportify track id used for audio feautre search
track_id = DNF_song_search['tracks']['items'][0]['uri']
#splits string to search for features
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(row['Artist'])
track_name_list.append(row['Song'])
else:
print('Inconsistent artist match on: ' + artist + ' ' + artist_name + ' for song ' + track)
#spotify api to save song features based on track ids
features = sp.audio_features(track_id_list)
#save features list into pandas df
features_df = pd.DataFrame(data = features)
#add artist and song columns from imported billboard df
features_df['Artist'] = artist_name_list
features_df['Song'] = track_name_list
#combine the two dataframes
df_merged = pd.merge(df_import, features_df, on = 'Song', how = 'left')
df_merged.to_csv('merged.csv')
I have tried saving all of the songs ids into a list, and then executing the api 100 ids at a time, but I get various errors when I try to save into a new dataframe.
solved myself
track_id_list = []
artist_name_list = []
track_name_list = []
for n in range(len(df_import) // 100):
for r in range(99):
artist = df_import.iloc[r+(n*100),3]
track = df_import.iloc[r+(n*100),4]
try:
spotify_response = sp.search(q='artist:' + artist + ' track:' + track, type='track')
artist_name = spotify_response['tracks']['items'][0]['artists'][0]['name']
track_name = spotify_response['tracks']['items'][0]['name']
#unique spotify track id used for audio feature search
track_id = spotify_response['tracks']['items'][0]['uri']
#splits string to search for features
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(artist)
track_name_list.append(track)
except:
DNF_song_search = sp.search(q=track)
artist_name = DNF_song_search['tracks']['items'][0]['artists'][0]['name']
if search(artist_name, artist):
track_name = DNF_song_search['tracks']['items'][0]['name']
track_id = DNF_song_search['tracks']['items'][0]['uri']
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(artist)
track_name_list.append(track)
else:
print('Inconsistent artist match on: ' + artist + ' ' + artist_name + ' for song ' + track)
features_df = pd.DataFrame()
for num in range(len(track_id_list) // 100 + 1):
features = sp.audio_features(track_id_list[(num*100):(num+1)*100])
features_df = features_df.append(pd.DataFrame(features))
#add artist and song columns from imported billboard df
features_df['Artist'] = artist_name_list
features_df['Song'] = track_name_list
#combine the two dataframes
df_merged = pd.merge(df_import, features_df.drop_duplicates(), on = 'Song', how = 'left')
df_merged.to_csv('mergedv2.csv')
I am trying to extract audio features from Spotify using track URIs. I have a list of 500k and would like to extract audio features for all. I have a workable code below and can extract features of 80 songs. I need some help in modifying the code below to extract 80 at a time so I don't run afoul of the Spotify limit. An example of the list is below
['spotify:track:2d7LPtieXdIYzf7yHPooWd',
'spotify:track:0y4TKcc7p2H6P0GJlt01EI',
'spotify:track:6q4c1vPRZREh7nw3wG7Ixz',
'spotify:track:54KFQB6N4pn926IUUYZGzK',
'spotify:track:0NeJjNlprGfZpeX2LQuN6c']
client_id = 'xxx'
client_secret = 'xxx'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
def get_audio_features(saved_uris):
artist = []
track = []
danceability = []
energy = []
key = []
loudness = []
mode = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []
tempo = []
duration_ms = []
for uri in saved_uris:
x = sp.audio_features(uri)
y = sp.track(uri)
for audio_features in x:
danceability.append(audio_features['danceability'])
energy.append(audio_features['energy'])
key.append(audio_features['key'])
loudness.append(audio_features['loudness'])
mode.append(audio_features['mode'])
speechiness.append(audio_features['speechiness'])
acousticness.append(audio_features['acousticness'])
instrumentalness.append(audio_features['instrumentalness'])
liveness.append(audio_features['liveness'])
valence.append(audio_features['valence'])
tempo.append(audio_features['tempo'])
duration_ms.append(audio_features['duration_ms'])
artist.append(y['album']['artists'][0]['name'])
track.append(y['name'])
df = pd.DataFrame()
df['artist'] = artist
df['track'] = track
df['danceability'] = danceability
df['energy'] = energy
df['key'] = key
df['loudness'] = loudness
df['mode'] = mode
df['speechiness'] = speechiness
df['acousticness'] = acousticness
df['instrumentalness'] = instrumentalness
df['liveness'] = liveness
df['valence'] = valence
df['tempo'] = tempo
df['duration_ms'] = duration_ms
df.to_csv('data/xxx.csv')
return df
My output is a dataframe and it looks like this and I have cut some columns for readibility:
artist track danceability energy key loudness
Sleeping At Last Chasing Cars 0.467 0.157 11
This code will return you dataframe that you require.
import spotipy
import time
from spotipy.oauth2 import SpotifyClientCredentials #To access authorised Spotify data4
import pandas as pd
client_id = 'paste client_id here'
client_secret = 'paste client_secret here'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace=False
#your uri list goes here
s_list = ['spotify:track:2d7LPtieXdIYzf7yHPooWd','spotify:track:0y4TKcc7p2H6P0GJlt01EI','spotify:track:6q4c1vPRZREh7nw3wG7Ixz','spotify:track:54KFQB6N4pn926IUUYZGzK','spotify:track:0NeJjNlprGfZpeX2LQuN6c']
#put uri to dataframe
df = pd.DataFrame(s_list)
df.columns = ['URI']
df['energy'] = ''*df.shape[0]
df['loudness'] = ''*df.shape[0]
df['speechiness'] = ''*df.shape[0]
df['valence'] = ''*df.shape[0]
df['liveness'] = ''*df.shape[0]
df['tempo'] = ''*df.shape[0]
df['danceability'] = ''*df.shape[0]
for i in range(0,df.shape[0]):
time.sleep(random.uniform(3, 6))
URI = df.URI[i]
features = sp.audio_features(URI)
df.loc[i,'energy'] = features[0]['energy']
df.loc[i,'speechiness'] = features[0]['speechiness']
df.loc[i,'liveness'] = features[0]['liveness']
df.loc[i,'loudness'] = features[0]['loudness']
df.loc[i,'danceability'] = features[0]['danceability']
df.loc[i,'tempo'] = features[0]['tempo']
df.loc[i,'valence'] = features[0]['valence']
uri=0
Output:
Hope, this solves your problem.
I am currently making a twitter scraper and I want get all tweets with multiple hashtags. The problem is I receive 429 errors every time I try to get passed the first hashtag. Ive tried sleeping the function but every time the second hashtag comes around it doesn't work.
import tweepy
import time
import json
from collections import defaultdict as dd
f = open("tokens.txt", 'r')
consumer_key = f.readline().strip()
consumer_secret = f.readline().strip()
app_key = f.readline().strip()
app_secret = f.readline().strip()
auth =tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(app_key,app_secret)
api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
usercount = dd(int)
userfollowers = dd(int)
mostretweets = dd(int)
mostfav = dd(int)
hashtag = ['#csforall','#equality']
for i in hashtag :
for status in tweepy.Cursor(api.search, q=i,since="2017-02-25",until="2017-02-28",lang="en").items():
parsed = status._json
usercount[parsed['user']['name'].encode("utf-8")]+=1
userfollowers[parsed['user']['name'].encode("utf-8")]= parsed['user']['followers_count']
mostretweets[parsed['text'].encode('utf-8')] = parsed['retweet_count']
mostfav[parsed['text'].encode('utf-8')] = parsed['favorite_count']
time.sleep(2)
sortcount = sorted(usercount.items(), key=lambda x: x[1], reverse =True)
top = sortcount[:1]
frequser=[]
for i in sortcount:
if i[1] == top:
frequser.append(i)
else:
break
print ("Top most frequent user: \n " + str(i[0])) +"\n"
followcount = sorted(userfollowers.items(), key=lambda x: x[1], reverse =True)
fol = followcount[:1]
freqfollow = []
for j in followcount:
if j[1] == fol:
freqfollow.append(i)
else:
break
print ("User with most followers: \n " + str(j[1]))
retweetcount = sorted(mostretweets.items(), key=lambda x: x[1], reverse = True)
ret = retweetcount[:1]
freqretweet =[]
for i in retweetcount:
if i[1] == ret:
freqretweet == ret
else:
break
print str(i[0])+"\n"
favcount = sorted(mostfav.items(), key=lambda x: x[1], reverse = True)
ret = favcount[:1]
freqfav =[]
for i in favcount:
if i[1] == ret:
freqfav == ret
else:
break
print str(i[0])+"\n"
Does putting this:
for i in hashtag:
time.sleep(2)
for status in tweepy.Cursor(api.search, q=i,since="2017-02-25",until="2017-02-28",lang="en").items():
work?
I'm working on a Sentiment Analysis project using Twitter Data, and I've encountered a small problem regarding Dates. The code itself runs fine, but I don't know how to build custom time blocks for grouping my final data. Right now, it is defaulting to grouping them by the second, which is not very useful. I want to be able to group them in half-hour, hour, and day segments...
Feel free to skip to the bottom of the code to see where the issue lies!
Here is the code:
import tweepy
API_KEY = "XXXXX"
API_SECRET = XXXXXX"
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
import sklearn as sk
import pandas as pd
import got3
#"Get Old Tweets" to find older data
tweetCriteria = got3.manager.TweetCriteria()
tweetCriteria.setQuerySearch("Kentucky Derby")
tweetCriteria.setSince("2016-05-07")
tweetCriteria.setUntil("2016-05-08")
tweetCriteria.setMaxTweets(1000)
TweetCriteria = got3.manager.TweetCriteria()
KYDerby_tweets = got3.manager.TweetManager.getTweets(tweetCriteria)
from afinn import Afinn
afinn = Afinn()
#getting afinn library to use for sentiment polarity analysis
for x in KYDerby_tweets:
Text = x.text
Retweets = x.retweets
Favorites = x.favorites
Date = x.date
Id = x.id
print(Text)
AllText = []
AllRetweets = []
AllFavorites = []
AllDates = []
AllIDs = []
for x in KYDerby_tweets:
Text = x.text
Retweets = x.retweets
Favorites = x.favorites
Date = x.date
AllText.append(Text)
AllRetweets.append(Retweets)
AllFavorites.append(Favorites)
AllDates.append(Date)
AllIDs.append(Id)
data_set = [[x.id, x.date, x.text, x.retweets, x.favorites]
for x in KYDerby_tweets]
df = pd.DataFrame(data=data_set, columns=["Id", "Date", "Text", "Favorites", "Retweets"])
#I now have a DataFrame with my basic info in it
pscore = []
for x in KYDerby_tweets:
afinn.score(x.text)
pscore.append(afinn.score(x.text))
df['P Score'] = pscore
#I now have the pscores for each Tweet in the DataFrame
nrc = pd.read_csv('C:\\users\\andrew.smith\\downloads\\NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt', sep="\t", names=["word", "emotion", "association"], skiprows=45)
#import NRC emotion lexicon
nrc = nrc[nrc["association"]==1]
nrc = nrc[nrc["emotion"].isin(["positive", "negative"]) == False]
#cleaned it up a bit
from nltk import TweetTokenizer
tt = TweetTokenizer()
tokenized = [x.lower() for x in tokenized]
#built my Tweet-specific, NRC-ready tokenizer
emotions = list(set(nrc["emotion"]))
index2emotion = {}
emotion2index = {}
for i in range(len(emotions)):
index2emotion[i] = emotions[i]
emotion2index[emotions[i]] = i
cv = [0] * len(emotions)
#built indices showing locations of emotions
for token in tokenized:
sub = nrc[nrc['word'] == token]
token_emotions = sub['emotion']
for e in token_emotions:
position_index = emotion2index[e]
cv[position_index]+=1
emotions = list(set(nrc['emotion']))
index2emotion = {}
emotion2index = {}
for i in range(len(emotions)):
index2emotion[i] = emotions[i]
emotion2index[emotions[i]] = i
def makeEmoVector(tweettext):
cv = [0] * len(emotions)
tokenized = tt.tokenize(tweettext)
tokenized = [x.lower() for x in tokenized]
for token in tokenized:
sub = nrc[nrc['word'] == token]
token_emotions = sub['emotion']
for e in token_emotions:
position_index = emotion2index[e]
cv[position_index] += 1
return cv
tweettext = df.iloc[14,:]['Text']
emotion_vectors = []
for text in df['Text']:
emotion_vector = makeEmoVector(text)
emotion_vectors.append(emotion_vector)
ev = pd.DataFrame(emotion_vectors, index=df.index, columns=emotions)
#Now I have a DataFrame with all of the emotion counts for each tweet
Date_Group = df.groupby("Date")
Date_Group[emotions].agg("sum")
#Finally, we arrive at the problem! When I run this, I end up with tweets that are grouped *by the second. What I want is to be able to group them: a) by the half-hour, b) by the hour, and c) by the day
Since, the default date format for tweets with the Tweepy API is "2017-04-14 18:41:56". To get tweets grouped by hour, you can do something as simple as this:
# This will get the time parameter
time = [item.split(" ")[1] for item in df['date'].values]
# This will get the hour parameter
hour = [item.split(":")[0] for item in time]
df['time'] = hour
grouped_tweets = df[['time', 'number_tweets']].groupby('time')
tweet_growth_hour = grouped_tweets.sum()
tweet_growth_hour['time']= tweet_growth_hour.index
print tweet_growth_hour
To group by date, you can do something similiar like:
days = [item.split(" ")[0] for item in df['date'].values]
df['days'] = days
grouped_tweets = df[['days', 'number_tweets']].groupby('days')
tweet_growth_days = grouped_tweets.sum()
tweet_growth_days['days']= tweet_growth_days.index
print tweet_growth_days