I am using a Python script to loop through a list of subreddits and pull their posts. The list is long, however, and occassionally there will be 403, 404, etc. errors in there. I am attempting to bypass those which give errors, but have been unable to do so thus far. The code is below.
I am using a list of subreddits and praw to pull from them. However, the list is quite long and occasionally a subreddit on it will be deleted, resulting in an HTTP exception (403, 404, etc). My code is below, does anyone know a line or two I can put in to skip those which give errors?
df = pd.read_csv('reddits.csv', sep = ',')
df.head()
Submission = namedtuple('Submission', ['time', 'score', 'title', 'text', 'author', 'comments', 'url', 'domain', 'permalink', 'ups', 'downs', 'likes', 'crosspost', 'duplicates', 'views'])
data = []
for i in df.reddits:
subreddit = reddit.subreddit(i)
for submission in subreddit.new(limit=10):
time = datetime.utcfromtimestamp(submission.created_utc)
score = submission.score
title = submission.title
text = submission.selftext
author = submission.author
comments = submission.num_comments
url = submission.url
domain = submission.domain
permalink = submission.permalink
ups = submission.ups
downs = submission.downs
likes = submission.likes
crosspost = submission.num_crossposts
duplicates = submission.num_duplicates
views = submission.view_count
data.append(Submission(time, score, title, text, author, comments, url, domain, permalink, ups, downs, likes, crosspost, duplicates, views))
df = pd.DataFrame(data)
os.chdir('wd')
filename = i + str(datetime.now()) + '.csv'
df.to_csv(filename, index=False, encoding='utf-8')
You need to catch the exception, then you can continue
df = pd.read_csv('reddits.csv', sep = ',')
df.head()
Submission = namedtuple('Submission', ['time', 'score', 'title', 'text', 'author', 'comments', 'url', 'domain', 'permalink', 'ups', 'downs', 'likes', 'crosspost', 'duplicates', 'views'])
data = []
for i in df.reddits:
try:
subreddit = reddit.subreddit(i)
except HTTPError as e:
print(f"Got {e} retrieving {subreddit}")
continue # control passes back to next iteration of outer loop
for submission in subreddit.new(limit=10):
submission = Submission(
datetime.utcfromtimestamp(submission.created_utc),
submission.score,
submission.title,
submission.selftext,
submission.author,
submission.num_comments,
submission.url,
submission.domain,
submission.permalink,
submission.ups,
submission.downs,
submission.likes,
submission.num_crossposts,
submission.num_duplicates,
submission.view_count,
)
data.append(submission)
df = pd.DataFrame(data)
os.chdir('wd')
filename = i + str(datetime.now()) + '.csv'
df.to_csv(filename, index=False, encoding='utf-8')
also, unrelated: i is not a good name for the value; it traditionally stands for "index", which is not what is contained there. e would be the corresponding generic name, standing for "element", but reddit would be the idiomatic choice in python.
Related
I have a multiple project in my gitlab repository wherein I do perform multiple commits when it requires.
I have develop a code in python through which I can get report of all the commits done by me in a csv format for all the projects available in gitlab repository as I have hard coded the the project ids in my python code as a LIST.
The Header of the csv file is : Date, submitted, gitlab_url, project, username, subject.
Now I want to run the pipeline manually by setting up an environment variable as 'Project_Ids'
and want to pass some of the project ids as value (More than one project id as a value) so that csv report should get generated for only these projects which has been passed as a value in environment variable.
so My question is , How can I pass multiple project ids as a value in 'Project_Ids' key while running the pipeline manually.
enter image description here
import gitlab
import os
import datetime
import csv
import re
Project_id_list = ['9427','8401','17937','26813','24899','23729','34779','27638','28600']
headerList = ['Date', 'Submitted', 'Gitlab_url', 'Project', 'Branch', 'Status', 'Username', 'Ticket', 'Subject']
filename = 'mydemo_{}'.format(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
# private token authentication
gl = gitlab.Gitlab('https://main.gitlab.in.com/', private_token="MLyWwLyEhU2zZjjjhZXog")
gl.auth()
# list all projects
for m in Project_id_list:
i=0
if (i<len(Project_id_list)):
i=+1
print(m)
projects = gl.projects.get(m)
commits = projects.commits.list(all=True, query_parameters={'ref_name': 'master'})
with open(f"{filename}_{m}.csv", 'w', newline="") as file:
dw = csv.DictWriter(file, delimiter=',',
fieldnames=headerList)
dw.writeheader()
for commit in commits:
print(commit)
msg = commit.message
if 'master' in msg or 'LCS-' in msg:
projectName = projects.path_with_namespace
branch = 'master'
status = 'merged'
date = commit.committed_date.split('T')[0]
submitted1 = commit.created_at.split('T')[1]
submitted = submitted1.split('.000')[0]
Gitlab_url = commit.web_url.split('-')[0]
username = commit.author_name
subject = commit.title
subject1 = commit.message.splitlines()
print(subject1)
subject2 = subject1[0:3]
print(subject2)
subject3 = ' '.join(subject2)
print(subject3)
match = re.search('S-\d+', subject3)
if match:
ticket = match.group(0)
ticket_url = 'https://.in.com/browse/' + str(ticket)
ticket1 = ticket_url
dw.writerow({'Date': date, 'Submitted': submitted, 'Gitlab_url': Gitlab_url, 'Project': projectName,
'Branch': branch, 'Status': status, 'Username': username, 'Ticket': ticket1,
'Subject': subject3})
else:
ticket1 = 'Not Found'
dw.writerow({'Date': date, 'Submitted': submitted, 'Gitlab_url': Gitlab_url, 'Project': projectName,
'Branch': branch, 'Status': status, 'Username': username, 'Ticket': ticket1,
'Subject': subject3})
Just use a space or some other delimiter in the variable value. For example, a string like 123 456 789
Then in Python, simply parse the variable. For example, using the string .split method to split on whitespace.
import os
...
project_ids_variable = os.environ.get('PROJECT_IDS', '') # '123 456 789'
project_ids = project_ids_variable.split() # ['123', '456', '789']
for project_id in project_ids:
project = gl.projects.get(project_id)
print(project)
I am getting the tweets and the corresponding id of that user in an object obj. I want to know why I don't get the other informations like conversation_id. I want to use it to get the replies and the quotes. That's the solution that I found in the internet but didn't know how to make it work.
Does any anyone know to extract the conversation_id or any other parameters like geo.place_id? I am using tweepy but if anyone has any other solution using another library to get the same result it will be also helpful. Thanks for your help!!!
You can try the code if you create another file config and define your tokens. I can't share mine due to security purposes.
import tweepy
import config
users_name = ['derspiegel', 'zeitonline']
tweet_tab = []
def getClient():
client = tweepy.Client(bearer_token=config.BEARER_TOKEN,
consumer_key=config.API_KEY,
consumer_secret=config.API_KEY_SECRET,
access_token=config.ACCESS_TOKEN,
access_token_secret=config.ACCESS_TOKEN_SECRET)
def searchTweets(client):
for i in users_name:
client = getClient()
user = client.get_user(username=i)
userId = user.data.id
tweets = client.get_users_tweets(userId,
expansions=[
'author_id', 'referenced_tweets.id', 'referenced_tweets.id.author_id',
'in_reply_to_user_id', 'attachments.media_keys', 'entities.mentions.username', 'geo.place_id'],
tweet_fields=[
'id', 'text', 'author_id', 'created_at', 'conversation_id', 'entities',
'public_metrics', 'referenced_tweets'
],
user_fields=[
'id', 'name', 'username', 'created_at', 'description', 'public_metrics',
'verified'
],
place_fields=['full_name', 'id'],
media_fields=['type', 'url', 'alt_text', 'public_metrics'])
if not tweets is None and len(tweets) > 0:
obj = {}
obj['id'] = userId
obj['text'] = tweets
tweet_tab.append(obj)
return tweet_tab
searchTweets(client)
print("tableau final", tweet_tab)
my guess is that you need to put the ids into a list through which the function can iterate. Create the id list and try:
def get_tweets_from_timelines():
tweets_timelines_list = []
for id in range(0, len(ids), 1):
one_id = (ids[id:id+1])
one_id = ' '.join(one_id)
for tweet in tweepy.Paginator(client.get_users_tweets, id=one_id, max_results=100,
tweet_fields=['attachments', 'author_id', 'context_annotations', 'created_at', 'entities', \
'conversation_id', 'possibly_sensitive', 'public_metrics', 'referenced_tweets', \
'reply_settings', 'source', 'withheld' ],\
user_fields=['created_at', 'description', 'entities', 'profile_image_url', 'protected', \
'public_metrics', 'url', 'verified', 'withheld'],
expansions=['referenced_tweets.id', 'in_reply_to_user_id', 'attachments.media_keys', ],
media_fields=['preview_image_url'],
):
tweets_timelines_list.append(tweet)
return tweets_timelines_list
I am using YouTubes API to get comment data from a list of music videos. The way I have it working right now is by manually typing in my query and then writing the data to a csv file and repeating for each song like such.
query = "song title"
query_results = service.search().list(
part = 'snippet',
q = query,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()
What I would like to do is use the title and artist columns from a csv file that I have containing the song titles I am trying to gather data for so I can run the program once without having to manually type in the song each time.
A friend suggested using something like this
import pandas as pd
data = pd.read_csv("metadata.csv")
def songtitle():
for i in data.index:
title = data.loc[i,'title']
title = '\"' + title + '\"'
artist = data.loc[i,'artist']
return(artist, title)
But I am not sure how I would make this work because when I run this, it is only returning the final row of data, and even if it did run correctly, how I would handle getting the entire program to repeat it self for every instance of a new song.
you can save song title and artist to a list, the loop over that list to get details.
def get_songTitles():
data = pd.read_csv("metadata.csv")
return data['artist'].tolist(),data['title'].tolist()
artist, song_titles = get_songTitles()
for song in song_titles:
query_results = service.search().list(
part = 'snippet',
q = song,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()
I'm new to programming and I've looked at previous answers to this question but none seem relevant to this specific query.
I'm learning to analyse data with python.
This is the code:
import pandas as pd
import os
os.chdir('/Users/Benjy/Documents/Python/Data Analysis Python')
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header = None, names = unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header = None, names = rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header = None, names = mnames)
data = pd.merge(pd.merge(ratings, users), movies)
mean_ratings=data.pivot_table('ratings',rows='title', cols='gender',aggfunc='mean')
I keep getting an error saying mean_ratings is not defined...but surely it is defined in the last line of code above?
I think this will work: mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
I'm a little confused on how I would populate the following csv function with the information in my models.py for a given user. Can anyone point me in the right direction? Do I need to process the information in a separare py file, or can I do it in my views?
My view to download the info
def download(request):
response = HttpResponse(mimetype='text/csv')
response['Content-Disposition'] = 'attachment; filename=UserData.csv'
writer = csv.writer(response)
writer.writerow(['Date', 'HighBGL', 'LowBGL', 'Diet', 'Weight', 'Height', 'Etc'])
writer.writerow(['Info pertaining to date 1'])
writer.writerow(['info pertaining to date 2'])
return response
One of the models who's info i'm interesting in saving
class DailyVital(models.Model):
user = models.ForeignKey(User)
entered_at = models.DateTimeField()
high_BGL = models.IntegerField()
low_BGL = models.IntegerField()
height = models.IntegerField(blank = True, null = True)
weight = models.IntegerField(blank = True, null = True)
First you need to query your django model, something like: DailyVital.objects.all() or DailyVital.objects.filter(user=request.user)
Then you can either transform the objects manually into tuples, or you can use Django QuerySet's values_list method with a list of field names to return tuples instead of objects. Something like:
def download(request):
response = HttpResponse(mimetype='text/csv')
response['Content-Disposition'] = 'attachment; filename=UserData.csv'
writer = csv.writer(response)
writer.writerow(['Date', 'HighBGL', 'LowBGL', 'Weight', 'Height'])
query = DailyVital.objects.filter(user=request.user)
for row in query.values_list('entered_at', 'high_BGL', 'low_BGL', 'weight', 'height'):
writer.writerow(row)
return response
If you didn't need it in Django, you might also consider the sqlite3 command line program's -csv option.
An easy way to do this would be to convert your models into a list of lists.
First you need an object to list function:
def object2list(obj, attr_list):
" returns values (or None) for the object's attributes in attr_list"
return [getattr(obj, attr, None) for attr in attr_list]
Then you just pass that to the csvwriter with a list comprehension (given some list_of_objects that you've queried)
attr_list = ['date', 'high_BGL', 'low_BGL', 'diet', 'weight', 'height']
writer.writerows([object2list(obj, attr_list) for obj in list_of_objects])