Working on a project to scrape billboard top 100 over multiple weeks, look up song audio features using Spotify's API, and save the info in a new pandas df.
I got this to work for up to 100 searches at a time (the spotify api only allows 100 ids), but I am having trouble writing code for iterating through the song ids 100 at a time, running the api, and saving into a new df.
Below is the working code for 100 id searches at a time:
df_import = pd.read_csv(r'xxx/Billboard_Top_100.csv')
track_id_list = []
artist_name_list = []
track_name_list = []
for item, row in df_import.head(100).iterrows():
artist = row['Artist']
track = row['Song']
try:
spotify_response = sp.search(q='artist:' + artist + ' track:' + track, type='track')
#artist name
artist_name = spotify_response['tracks']['items'][0]['artists'][0]['name']
#song name
track_name = spotify_response['tracks']['items'][0]['name']
#unique sportify track id used for audio feautre search
track_id = spotify_response['tracks']['items'][0]['uri']
#splits string to search for features
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(row['Artist'])
track_name_list.append(row['Song'])
except:
DNF_song_search = sp.search(q=track)
artist_name = DNF_song_search['tracks']['items'][0]['artists'][0]['name']
if search(artist_name, artist):
#song name
track_name = DNF_song_search['tracks']['items'][0]['name']
#unique sportify track id used for audio feautre search
track_id = DNF_song_search['tracks']['items'][0]['uri']
#splits string to search for features
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(row['Artist'])
track_name_list.append(row['Song'])
else:
print('Inconsistent artist match on: ' + artist + ' ' + artist_name + ' for song ' + track)
#spotify api to save song features based on track ids
features = sp.audio_features(track_id_list)
#save features list into pandas df
features_df = pd.DataFrame(data = features)
#add artist and song columns from imported billboard df
features_df['Artist'] = artist_name_list
features_df['Song'] = track_name_list
#combine the two dataframes
df_merged = pd.merge(df_import, features_df, on = 'Song', how = 'left')
df_merged.to_csv('merged.csv')
I have tried saving all of the songs ids into a list, and then executing the api 100 ids at a time, but I get various errors when I try to save into a new dataframe.
solved myself
track_id_list = []
artist_name_list = []
track_name_list = []
for n in range(len(df_import) // 100):
for r in range(99):
artist = df_import.iloc[r+(n*100),3]
track = df_import.iloc[r+(n*100),4]
try:
spotify_response = sp.search(q='artist:' + artist + ' track:' + track, type='track')
artist_name = spotify_response['tracks']['items'][0]['artists'][0]['name']
track_name = spotify_response['tracks']['items'][0]['name']
#unique spotify track id used for audio feature search
track_id = spotify_response['tracks']['items'][0]['uri']
#splits string to search for features
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(artist)
track_name_list.append(track)
except:
DNF_song_search = sp.search(q=track)
artist_name = DNF_song_search['tracks']['items'][0]['artists'][0]['name']
if search(artist_name, artist):
track_name = DNF_song_search['tracks']['items'][0]['name']
track_id = DNF_song_search['tracks']['items'][0]['uri']
track_id_split = str.split(track_id, 'spotify:track:')
track_id_list.append(track_id_split[1])
artist_name_list.append(artist)
track_name_list.append(track)
else:
print('Inconsistent artist match on: ' + artist + ' ' + artist_name + ' for song ' + track)
features_df = pd.DataFrame()
for num in range(len(track_id_list) // 100 + 1):
features = sp.audio_features(track_id_list[(num*100):(num+1)*100])
features_df = features_df.append(pd.DataFrame(features))
#add artist and song columns from imported billboard df
features_df['Artist'] = artist_name_list
features_df['Song'] = track_name_list
#combine the two dataframes
df_merged = pd.merge(df_import, features_df.drop_duplicates(), on = 'Song', how = 'left')
df_merged.to_csv('mergedv2.csv')
Related
I am trying to adapt TF-IDF on my data ([using the code by Dr. W.J.B. Mattingly: https://github.com/wjbmattingly/topic_modeling_textbook/blob/main/lessons/02_tf_idf_official.py) on my data - descriptions of the startups from Startup blink website.
I cannot get the main idea on how to better deal with the extraction of all words as now the output is the string with all words all together like this - also you will notice lots of empty lists inside as well:
[['qualitygeotechnicalinvestigationtestinggeotechnicalreportspreconditiondevelopmentideasnewprojectimplementationintensivefieldlaboratorytestingsnecessaryobtaininputdatasoillayerscapacitysettlementcategorizationqualitymaterials']
s = requests.Session()
df = pd.DataFrame()
for p in tqdm(range(2000)):
r = s.get(f'https://www.startupblink.com/api/entities?entity=startups&page={p}')
d = pd.json_normalize(r.json()['page'])
df = pd.concat([df, d], axis=0, ignore_index=True)
df.to_csv('World_startups.csv')
# selecting only ESG related startups
esg = df[df['subindustry_name'].isin(['Energy', 'Energy & Environment-Other', 'Smart Cities', 'Smart Home', 'Public Transportation', 'Sustainability',
'Transportation-Other','Waste Management'])]
esg = esg[['title', 'description', 'subindustry_name']]
description = esg.description.tolist()
#description = description.remove(np.nan)
def remove_stopwords(text, stops):
words = text.split()
final = []
for word in words:
if word not in stops:
final.append(word)
final = "".join(final)
final = final.translate(str.maketrans("", "", string.punctuation))
final = "".join([i for i in final if not i.isdigit()])
while " " in final:
final = final.replace(" ", " ")
return final
def clean_docs(docs):
stops = stopwords.words('english')
final = []
for doc in docs:
clean_doc = remove_stopwords(doc, stops)
final.append(clean_doc)
return (final)
cleaned_docs = clean_docs(description)
vectorizer = TfidfVectorizer(lowercase=True,
max_features=100,
# max_df=.9,# percentage
# min_df=2, # number of
ngram_range=(1,3),
stop_words = 'english') # up to triagrams
vectors = vectorizer.fit_transform(cleaned_docs)
feature_names = vectorizer.get_feature_names_out()
dense = vectors.todense()
denselist = dense.tolist()
# Printing all unique dense values to mid-check
densearray = numpy.array(denselist)
print(numpy.unique(densearray))
all_keywords = []
for d in denselist:
x=0
keywords = []
for word in d:
if word > 0:
keywords.append(feature_names[x])
x=x+1
all_keywords.append(keywords)
all_keywords[7]
print(len(all_keywords))
# the list contains lots of emptly lists inside - will remove them
all_keywords = [ele for ele in all_keywords if ele != []]
print('')
print(len(all_keywords))
print(all_keywords[7])
I am trying to extract audio features from Spotify using track URIs. I have a list of 500k and would like to extract audio features for all. I have a workable code below and can extract features of 80 songs. I need some help in modifying the code below to extract 80 at a time so I don't run afoul of the Spotify limit. An example of the list is below
['spotify:track:2d7LPtieXdIYzf7yHPooWd',
'spotify:track:0y4TKcc7p2H6P0GJlt01EI',
'spotify:track:6q4c1vPRZREh7nw3wG7Ixz',
'spotify:track:54KFQB6N4pn926IUUYZGzK',
'spotify:track:0NeJjNlprGfZpeX2LQuN6c']
client_id = 'xxx'
client_secret = 'xxx'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
def get_audio_features(saved_uris):
artist = []
track = []
danceability = []
energy = []
key = []
loudness = []
mode = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []
tempo = []
duration_ms = []
for uri in saved_uris:
x = sp.audio_features(uri)
y = sp.track(uri)
for audio_features in x:
danceability.append(audio_features['danceability'])
energy.append(audio_features['energy'])
key.append(audio_features['key'])
loudness.append(audio_features['loudness'])
mode.append(audio_features['mode'])
speechiness.append(audio_features['speechiness'])
acousticness.append(audio_features['acousticness'])
instrumentalness.append(audio_features['instrumentalness'])
liveness.append(audio_features['liveness'])
valence.append(audio_features['valence'])
tempo.append(audio_features['tempo'])
duration_ms.append(audio_features['duration_ms'])
artist.append(y['album']['artists'][0]['name'])
track.append(y['name'])
df = pd.DataFrame()
df['artist'] = artist
df['track'] = track
df['danceability'] = danceability
df['energy'] = energy
df['key'] = key
df['loudness'] = loudness
df['mode'] = mode
df['speechiness'] = speechiness
df['acousticness'] = acousticness
df['instrumentalness'] = instrumentalness
df['liveness'] = liveness
df['valence'] = valence
df['tempo'] = tempo
df['duration_ms'] = duration_ms
df.to_csv('data/xxx.csv')
return df
My output is a dataframe and it looks like this and I have cut some columns for readibility:
artist track danceability energy key loudness
Sleeping At Last Chasing Cars 0.467 0.157 11
This code will return you dataframe that you require.
import spotipy
import time
from spotipy.oauth2 import SpotifyClientCredentials #To access authorised Spotify data4
import pandas as pd
client_id = 'paste client_id here'
client_secret = 'paste client_secret here'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace=False
#your uri list goes here
s_list = ['spotify:track:2d7LPtieXdIYzf7yHPooWd','spotify:track:0y4TKcc7p2H6P0GJlt01EI','spotify:track:6q4c1vPRZREh7nw3wG7Ixz','spotify:track:54KFQB6N4pn926IUUYZGzK','spotify:track:0NeJjNlprGfZpeX2LQuN6c']
#put uri to dataframe
df = pd.DataFrame(s_list)
df.columns = ['URI']
df['energy'] = ''*df.shape[0]
df['loudness'] = ''*df.shape[0]
df['speechiness'] = ''*df.shape[0]
df['valence'] = ''*df.shape[0]
df['liveness'] = ''*df.shape[0]
df['tempo'] = ''*df.shape[0]
df['danceability'] = ''*df.shape[0]
for i in range(0,df.shape[0]):
time.sleep(random.uniform(3, 6))
URI = df.URI[i]
features = sp.audio_features(URI)
df.loc[i,'energy'] = features[0]['energy']
df.loc[i,'speechiness'] = features[0]['speechiness']
df.loc[i,'liveness'] = features[0]['liveness']
df.loc[i,'loudness'] = features[0]['loudness']
df.loc[i,'danceability'] = features[0]['danceability']
df.loc[i,'tempo'] = features[0]['tempo']
df.loc[i,'valence'] = features[0]['valence']
uri=0
Output:
Hope, this solves your problem.
I'm working on a Sentiment Analysis project using Twitter Data, and I've encountered a small problem regarding Dates. The code itself runs fine, but I don't know how to build custom time blocks for grouping my final data. Right now, it is defaulting to grouping them by the second, which is not very useful. I want to be able to group them in half-hour, hour, and day segments...
Feel free to skip to the bottom of the code to see where the issue lies!
Here is the code:
import tweepy
API_KEY = "XXXXX"
API_SECRET = XXXXXX"
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True)
import sklearn as sk
import pandas as pd
import got3
#"Get Old Tweets" to find older data
tweetCriteria = got3.manager.TweetCriteria()
tweetCriteria.setQuerySearch("Kentucky Derby")
tweetCriteria.setSince("2016-05-07")
tweetCriteria.setUntil("2016-05-08")
tweetCriteria.setMaxTweets(1000)
TweetCriteria = got3.manager.TweetCriteria()
KYDerby_tweets = got3.manager.TweetManager.getTweets(tweetCriteria)
from afinn import Afinn
afinn = Afinn()
#getting afinn library to use for sentiment polarity analysis
for x in KYDerby_tweets:
Text = x.text
Retweets = x.retweets
Favorites = x.favorites
Date = x.date
Id = x.id
print(Text)
AllText = []
AllRetweets = []
AllFavorites = []
AllDates = []
AllIDs = []
for x in KYDerby_tweets:
Text = x.text
Retweets = x.retweets
Favorites = x.favorites
Date = x.date
AllText.append(Text)
AllRetweets.append(Retweets)
AllFavorites.append(Favorites)
AllDates.append(Date)
AllIDs.append(Id)
data_set = [[x.id, x.date, x.text, x.retweets, x.favorites]
for x in KYDerby_tweets]
df = pd.DataFrame(data=data_set, columns=["Id", "Date", "Text", "Favorites", "Retweets"])
#I now have a DataFrame with my basic info in it
pscore = []
for x in KYDerby_tweets:
afinn.score(x.text)
pscore.append(afinn.score(x.text))
df['P Score'] = pscore
#I now have the pscores for each Tweet in the DataFrame
nrc = pd.read_csv('C:\\users\\andrew.smith\\downloads\\NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt', sep="\t", names=["word", "emotion", "association"], skiprows=45)
#import NRC emotion lexicon
nrc = nrc[nrc["association"]==1]
nrc = nrc[nrc["emotion"].isin(["positive", "negative"]) == False]
#cleaned it up a bit
from nltk import TweetTokenizer
tt = TweetTokenizer()
tokenized = [x.lower() for x in tokenized]
#built my Tweet-specific, NRC-ready tokenizer
emotions = list(set(nrc["emotion"]))
index2emotion = {}
emotion2index = {}
for i in range(len(emotions)):
index2emotion[i] = emotions[i]
emotion2index[emotions[i]] = i
cv = [0] * len(emotions)
#built indices showing locations of emotions
for token in tokenized:
sub = nrc[nrc['word'] == token]
token_emotions = sub['emotion']
for e in token_emotions:
position_index = emotion2index[e]
cv[position_index]+=1
emotions = list(set(nrc['emotion']))
index2emotion = {}
emotion2index = {}
for i in range(len(emotions)):
index2emotion[i] = emotions[i]
emotion2index[emotions[i]] = i
def makeEmoVector(tweettext):
cv = [0] * len(emotions)
tokenized = tt.tokenize(tweettext)
tokenized = [x.lower() for x in tokenized]
for token in tokenized:
sub = nrc[nrc['word'] == token]
token_emotions = sub['emotion']
for e in token_emotions:
position_index = emotion2index[e]
cv[position_index] += 1
return cv
tweettext = df.iloc[14,:]['Text']
emotion_vectors = []
for text in df['Text']:
emotion_vector = makeEmoVector(text)
emotion_vectors.append(emotion_vector)
ev = pd.DataFrame(emotion_vectors, index=df.index, columns=emotions)
#Now I have a DataFrame with all of the emotion counts for each tweet
Date_Group = df.groupby("Date")
Date_Group[emotions].agg("sum")
#Finally, we arrive at the problem! When I run this, I end up with tweets that are grouped *by the second. What I want is to be able to group them: a) by the half-hour, b) by the hour, and c) by the day
Since, the default date format for tweets with the Tweepy API is "2017-04-14 18:41:56". To get tweets grouped by hour, you can do something as simple as this:
# This will get the time parameter
time = [item.split(" ")[1] for item in df['date'].values]
# This will get the hour parameter
hour = [item.split(":")[0] for item in time]
df['time'] = hour
grouped_tweets = df[['time', 'number_tweets']].groupby('time')
tweet_growth_hour = grouped_tweets.sum()
tweet_growth_hour['time']= tweet_growth_hour.index
print tweet_growth_hour
To group by date, you can do something similiar like:
days = [item.split(" ")[0] for item in df['date'].values]
df['days'] = days
grouped_tweets = df[['days', 'number_tweets']].groupby('days')
tweet_growth_days = grouped_tweets.sum()
tweet_growth_days['days']= tweet_growth_days.index
print tweet_growth_days
Sorry for the unsophisticated question title but I need help desperately:
My objective at work is to create a script that pulls all the records from exacttarget salesforce marketing cloud API. I have successfully setup the API calls, and successfully imported the data into DataFrames.
The problem I am running into is two-fold that I need to keep pulling records till "Results_Message" in my code stops reading "MoreDataAvailable" and I need to setup logic which allows me to control the date from either within the API call or from parsing the DataFrame.
My code is getting stuck at line 44 where "print Results_Message" is looping around the string "MoreDataAvailable"
Here is my code so far, on lines 94 and 95 you will see my attempt at parsing the date directly from the dataframe but no luck and no luck on line 32 where I have specified the date:
import ET_Client
import pandas as pd
AggreateDF = pd.DataFrame()
Data_Aggregator = pd.DataFrame()
#Start_Date = "2016-02-20"
#End_Date = "2016-02-25"
#retrieveDate = '2016-07-25T13:00:00.000'
Export_Dir = 'C:/temp/'
try:
debug = False
stubObj = ET_Client.ET_Client(False, debug)
print '>>>BounceEvents'
getBounceEvent = ET_Client.ET_BounceEvent()
getBounceEvent.auth_stub = stubObj
getBounceEvent.search_filter = {'Property' : 'EventDate','SimpleOperator' : 'greaterThan','Value' : '2016-02-22T13:00:00.000'}
getResponse1 = getBounceEvent.get()
ResponseResultsBounces = getResponse1.results
Results_Message = getResponse1.message
print(Results_Message)
#EventDate = "2016-05-09"
print "This is orginial " + str(Results_Message)
#print ResponseResultsBounces
i = 1
while (Results_Message == 'MoreDataAvailable'):
#if i > 5: break
print Results_Message
results1 = getResponse1.results
#print(results1)
i = i + 1
ClientIDBounces = []
partner_keys1 = []
created_dates1 = []
modified_date1 = []
ID1 = []
ObjectID1 = []
SendID1 = []
SubscriberKey1 = []
EventDate1 = []
EventType1 = []
TriggeredSendDefinitionObjectID1 = []
BatchID1 = []
SMTPCode = []
BounceCategory = []
SMTPReason = []
BounceType = []
for BounceEvent in ResponseResultsBounces:
ClientIDBounces.append(str(BounceEvent['Client']['ID']))
partner_keys1.append(BounceEvent['PartnerKey'])
created_dates1.append(BounceEvent['CreatedDate'])
modified_date1.append(BounceEvent['ModifiedDate'])
ID1.append(BounceEvent['ID'])
ObjectID1.append(BounceEvent['ObjectID'])
SendID1.append(BounceEvent['SendID'])
SubscriberKey1.append(BounceEvent['SubscriberKey'])
EventDate1.append(BounceEvent['EventDate'])
EventType1.append(BounceEvent['EventType'])
TriggeredSendDefinitionObjectID1.append(BounceEvent['TriggeredSendDefinitionObjectID'])
BatchID1.append(BounceEvent['BatchID'])
SMTPCode.append(BounceEvent['SMTPCode'])
BounceCategory.append(BounceEvent['BounceCategory'])
SMTPReason.append(BounceEvent['SMTPReason'])
BounceType.append(BounceEvent['BounceType'])
df1 = pd.DataFrame({'ClientID': ClientIDBounces, 'PartnerKey': partner_keys1,
'CreatedDate' : created_dates1, 'ModifiedDate': modified_date1,
'ID':ID1, 'ObjectID': ObjectID1,'SendID':SendID1,'SubscriberKey':SubscriberKey1,
'EventDate':EventDate1,'EventType':EventType1,'TriggeredSendDefinitionObjectID':TriggeredSendDefinitionObjectID1,
'BatchID':BatchID1,'SMTPCode':SMTPCode,'BounceCategory':BounceCategory,'SMTPReason':SMTPReason,'BounceType':BounceType})
#print df1
#df1 = df1[(df1.EventDate > "2016-02-20") & (df1.EventDate < "2016-02-25")]
#AggreateDF = AggreateDF[(AggreateDF.EventDate > Start_Date) and (AggreateDF.EventDate < End_Date)]
print(df1['ID'].max())
AggreateDF = AggreateDF.append(df1)
print(AggreateDF.shape)
#df1 = df1[(df1.EventDate > "2016-02-20") and (df1.EventDate < "2016-03-25")]
#AggreateDF = AggreateDF[(AggreateDF.EventDate > Start_Date) and (AggreateDF.EventDate < End_Date)]
print("Final Aggregate DF is: " + str(AggreateDF.shape))
#EXPORT TO CSV
AggreateDF.to_csv(Export_Dir +'DataTest1.csv')
#with pd.option_context('display.max_rows',10000):
#print (df_masked1.shape)
#print df_masked1
except Exception as e:
print 'Caught exception: ' + str(e.message)
print e
Before my code parses the data, the orginal format I get of the data is a SOAP response, this is what it look like(below). Is it possible to directly parse records based on EventDate from the SOAP response?
}, (BounceEvent){
Client =
(ClientID){
ID = 1111111
}
PartnerKey = None
CreatedDate = 2016-05-12 07:32:20.000937
ModifiedDate = 2016-05-12 07:32:20.000937
ID = 1111111
ObjectID = "1111111"
SendID = 1111111
SubscriberKey = "aaa#aaaa.com"
EventDate = 2016-05-12 07:32:20.000937
EventType = "HardBounce"
TriggeredSendDefinitionObjectID = "aa111aaa"
BatchID = 1111111
SMTPCode = "1111111"
BounceCategory = "Hard bounce - User Unknown"
SMTPReason = "aaaa"
BounceType = "immediate"
Hope this makes sense, this is my desperately plea for help.
Thank you in advance!
You don't seem to be updating Results_Message in your loop, so it's always going to have the value it gets in line 29: Results_Message = getResponse1.message. Unless there's code involved that you didn't share, that is.
I'm trying to fetch audio_features for several tracks. I'm using this:
client_credentials_manager = SpotifyClientCredentials(client_id='myid', client_secret='mysecret')
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace=False
if len(sys.argv) > 1:
artist_name = ' '.join(sys.argv[1:])
results = sp.search(q=artist_name, limit=50)
tids = []
for i, t in enumerate(results['tracks']['items']):
print(' ', i, t['name'])
tids.append(t['uri'])
start = time.time()
features = sp.audio_features(tids)
delta = time.time() - start
for feature in features:
analysis = sp._get(feature['analysis_url'])
print(json.dumps(analysis, indent=4))
print()
But when I run the code, I get:
features = sp.audio_features(tids)
AttributeError: 'Spotify' object has no attribute 'audio_features'
What am I missing? Thanks!