F.W. This isn't just a PRAW question, it leans toward Python more than PRAW. Python people are welcome to contribute, and please note this is not my mother language xD!
Essentially, I'm writing a Reddit bot using the PRAW that does the following:
Loop through "unsaved" posts
Loop through the comments of said posts (targeting subcomments)
If the comment contains "!completed", is written by the submitter OR is a moderator, and the parent comment is not by submitter:
Do etc., e.x. print("Hey")
No, I didn't explain that too well. Examples are better, so here xD:
Use cases:
- Post by #dudeOne
- Comment by #dudeTwo
- Comment with "!completed" by #dudeOne
- Post by #dudeOne
- Comment by #dudeTwo
- Comment with "!completed" by #moderatorOne
print("Hey"), and:
- Post by #dudeOne
- Comment by #dudeOne
- Comment with "!completed" by #dudeOne
... does nothing, maybe even removes + messages #dudeOne.
Here's my messy code (xD):
import praw
import os
import re
sub = "RedditsQuests"
client_id = os.environ.get('client_id')
client_secret = os.environ.get('client_secret')
password = os.environ.get('pass')
reddit = praw.Reddit(client_id=client_id,
client_secret=client_secret,
password=password,
user_agent='r/RedditsQuests bot',
username='TheQuestMaster')
for submission in reddit.subreddit(sub).new(limit=None):
submission.comments.replace_more(limit=None)
if submission.saved is False:
for comment in submission.comments.list():
if ((("!completed" in comment.body)) and ((comment.is_submitter) or ('RedditsQuests' in comment.author.moderated())) and (comment.parent().author.name is not submission.author.name)):
print("etc...")
There's a decently-sized stack, so I've added it in this bin for your reference. To me it looks like PRAW is timing out because the if-in-for loop is taking too long. I could be wrong though!
The issue (as you've said) is somewhat sporadic but I've narrowed it down. As it turns out, trying to fetch the subreddits moderated by /u/AutoModerator will sometimes time out (presumably because the list is long).
Figuring out the issue
Here's how I found the issue. Skip this section if you're only interested in the solution.
First, I modified your script to use try and except to catch the exception when it happened. Your traceback told me that it was happening on the line that starts with if ((("!completed" in comment.body)), specifically when fetching the subreddits that a user moderates. Here was my modified script:
for submission in reddit.subreddit(sub).new(limit=None):
submission.comments.replace_more(limit=None)
if submission.saved is False:
for comment in submission.comments.list():
try:
if (
(("!completed" in comment.body))
and (
(comment.is_submitter)
or ("RedditsQuests" in comment.author.moderated())
)
and (comment.parent().author.name is not submission.author.name)
):
print("etc...")
except Exception:
print(f'Author: {comment.author} ({type(comment.author)})')
And the output:
etc...
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
etc...
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
etc...
etc...
etc...
etc...
etc...
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
Author: AutoModerator (<class 'praw.models.reddit.redditor.Redditor'>)
etc...
etc...
With this in mind I wrote a very simple 3-line script to reproduce the issue:
import praw
reddit = praw.Reddit(...)
print(reddit.redditor("AutoModerator").moderated())
Sometimes this script would succeed but sometimes it would fail with the same socket read timeout. Presumably the timeout happens because AutoModerator moderates so many subreddits (at least 10,000), and the Reddit API takes too long to process the request.
Fixing the issue
Your script tries to determine whether the redditor in question is a moderator of the subreddit. You're doing this by checking if the subreddit is in the list of the user's moderated subreddits, but you can switch this to checking if the user is in the list of the subreddit's moderators. Not only should this not time out, but you'll be saving a lot of network requests because you can just fetch the list of moderators once.
The PRAW documentation of Subreddit shows how we can get a list of moderators of a subreddit. In your case, we can do
moderators = list(reddit.subreddit(sub).moderator())
Then, instead of checking "RedditsQuests" in comment.author.moderated(), we check
comment.author in moderators
Your code then becomes
import praw
import os
import re
sub = "RedditsQuests"
client_id = os.environ.get("client_id")
client_secret = os.environ.get("client_secret")
password = os.environ.get("pass")
reddit = praw.Reddit(
client_id=client_id,
client_secret=client_secret,
password=password,
user_agent="r/RedditsQuests bot",
username="TheQuestMaster",
)
moderators = list(reddit.subreddit(sub).moderator())
for submission in reddit.subreddit(sub).new(limit=None):
submission.comments.replace_more(limit=None)
if submission.saved is False:
for comment in submission.comments.list():
if (
(("!completed" in comment.body))
and ((comment.is_submitter) or (comment.author in moderators))
and (comment.parent().author.name is not submission.author.name)
):
print("etc...")
In my brief testing, this script runs many times faster, since we only get the list of moderators once, rather than fetching all subreddits moderated by all users who commented.
As an unrelated style note, instead of if submission.saved is False you should do if not submission.saved, which is the conventional way to check if a condition is false.
Related
I'm making a discord meme bot and its working but I have a small problem with it. It keeps sending the same set of old memes for some reason. How do I fix that?
def updateMemes(subreddit = "memes"):
global name
global url
subreddit = reddit.subreddit("memes")
allSubs = []
top = subreddit.top(limit=100)
for submission in top:
allSubs.append(submission)
randomSub = random.choice(allSubs)
name = randomSub.title
url = randomSub.url
Because you're explicitly requesting the top 100 posts of all time, which rarely change.
top = subreddit.top(limit=100)
As you can see in the PRAW docs, the default value for time_filter is "all". If you want to get the top posts of today/this week/... instead, which will vary a lot more, then pass an argument for the time_filter parameter.
The accepted values, including examples, are in the docs page I linked.
There's an issue with my code where no matter what I try, every time I reply to a tweet, it just posts as a regular status update on my timeline.
here is a snippet of the code
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
tweetid = status.id
tweetnouser = status.text.replace("#CarlWheezerBot", "")
username = '#'+status.user.screen_name
user_tweet = gTTS(text=tweetnouser, lang='en', slow=False)
# Saving the converted audio
user_tweet.save("useraudio/text2speech.mp3")
# importing the audio and getting the audio all mashed up
text2speech = AudioFileClip("useraudio/text2speech.mp3")
videoclip = VideoFileClip("original_video/original_cut.mp4")
editedAudio = videoclip.audio
# splicing the original audio with the text2speech
compiledAudio = CompositeAudioClip([editedAudio.set_duration(3.8), text2speech.set_start(3.8)])
videoclip.audio = compiledAudio
# saving the completed video fie
videoclip.write_videofile("user_video/edited.mp4", audio_codec='aac')
upload_result = api.media_upload("user_video/edited.mp4")
api.update_status( status='#CarlWheezerBot',in_reply_to_status_id=[tweetid], media_ids=[upload_result.media_id_string], auto_populate_reply_metadata=True)
I have also tried it without any status, as well as using status.id_str. Nothing seems to work., I have done it without the metadata parameter as well. I am following the documentation word for word.
OKAY. for everyone reading this in the future
use this in_reply_to_status_id=tweetid
do not use the square brackets. Everything works perfectly now
While playing around with it, I also noticed that you should also mention author of the tweet you're replying to, especially if you're replying to an existing reply because it will still post it as status update. Line from documentation:
in_reply_to_status_id – The ID of an existing status that the update is in reply to. Note: This parameter will be ignored unless the author of the Tweet this parameter references is mentioned within the status text. Therefore, you must include #username, where username is the author of the referenced Tweet, within the update.
I'm trying to copy all my Livejournal posts to my new blog on blogger.com. I do so by using slightly modified example that ships with the gdata python client. I have a json file with all of my posts imported from Livejournal. Issue is that blogger.com has a daily limit for posting new blog entries per day — 50, so you can imagine that 1300+ posts I have will be copied in a month, since I can't programmatically enter captcha after 50 imports.
I recently learned that there's also batch operation mode somewhere in gdata, but I couldn't figure how to use it. Googling didn't really help.
Any advice or help will be highly appreciated.
Thanks.
Update
Just in case, I use the following code
#!/usr/local/bin/python
import json
import requests
from gdata import service
import gdata
import atom
import getopt
import sys
from datetime import datetime as dt
from datetime import timedelta as td
from datetime import tzinfo as tz
import time
allEntries = json.load(open("todays_copy.json", "r"))
class TZ(tz):
def utcoffset(self, dt): return td(hours=-6)
class BloggerExample:
def __init__(self, email, password):
# Authenticate using ClientLogin.
self.service = service.GDataService(email, password)
self.service.source = "Blogger_Python_Sample-1.0"
self.service.service = "blogger"
self.service.server = "www.blogger.com"
self.service.ProgrammaticLogin()
# Get the blog ID for the first blog.
feed = self.service.Get("/feeds/default/blogs")
self_link = feed.entry[0].GetSelfLink()
if self_link:
self.blog_id = self_link.href.split("/")[-1]
def CreatePost(self, title, content, author_name, label, time):
LABEL_SCHEME = "http://www.blogger.com/atom/ns#"
# Create the entry to insert.
entry = gdata.GDataEntry()
entry.author.append(atom.Author(atom.Name(text=author_name)))
entry.title = atom.Title(title_type="xhtml", text=title)
entry.content = atom.Content(content_type="html", text=content)
entry.published = atom.Published(time)
entry.category.append(atom.Category(scheme=LABEL_SCHEME, term=label))
# Ask the service to insert the new entry.
return self.service.Post(entry,
"/feeds/" + self.blog_id + "/posts/default")
def run(self, data):
for year in allEntries:
for month in year["yearlydata"]:
for day in month["monthlydata"]:
for entry in day["daylydata"]:
# print year["year"], month["month"], day["day"], entry["title"].encode("utf-8")
atime = dt.strptime(entry["time"], "%I:%M %p")
hr = atime.hour
mn = atime.minute
ptime = dt(year["year"], int(month["month"]), int(day["day"]), hr, mn, 0, tzinfo=TZ()).isoformat("T")
public_post = self.CreatePost(entry["title"],
entry["content"],
"My name",
",".join(entry["tags"]),
ptime)
print "%s, %s - published, Waiting 30 minutes" % (ptime, entry["title"].encode("utf-8"))
time.sleep(30*60)
def main(data):
email = "my#email.com"
password = "MyPassW0rd"
sample = BloggerExample(email, password)
sample.run(data)
if __name__ == "__main__":
main(allEntries)
I would recommend using Google Blog converters instead ( https://code.google.com/archive/p/google-blog-converters-appengine/ )
To get started you will have to go through
https://github.com/google/gdata-python-client/blob/master/INSTALL.txt - Steps for setting up Google GData API
https://github.com/pra85/google-blog-converters-appengine/blob/master/README.txt - Steps for using Blog Convertors
Once you have everything setup , you would have to run the following command (its the LiveJournal Username and password)
livejournal2blogger.sh -u <username> -p <password> [-s <server>]
Redirect its output into a .xml file. This file can now be imported into a Blogger blog directly by going to Blogger Dashboard , your blog > Settings > Other > Blog tools > Import Blog
Here remember to check the Automatically publish all imported posts and pages option. I have tried this once before with a blog with over 400 posts and Blogger did successfully import & published them without issue
Incase you have doubts the Blogger might have some issues (because the number of posts is quite high) or you have other Blogger blogs in your account. Then just for precaution sake , create a separate Blogger (Google) account and then try importing the posts. After that you can transfer the admin controls to your real Blogger account (To transfer , you will first have to send an author invite , then raise your real Blogger account to admin level and lastly remove the dummy account. Option for sending invite is present at Settings > Basic > Permissions > Blog Authors )
Also make sure that you are using Python 2.5 otherwise these scripts will not run. Before running livejournal2blogger.sh , change the following line (Thanks for Michael Fleet for this fix http://michael.f1337.us/2011/12/28/google-blog-converters-blogger2wordpress/ )
PYTHONPATH=${PROJ_DIR}/lib python ${PROJ_DIR}/src/livejournal2blogger/lj2b.py $*
to
PYTHONPATH=${PROJ_DIR}/lib python2.5 ${PROJ_DIR}/src/livejournal2blogger/lj2b.py $*
P.S. I am aware this is not the answer to your question but as the objective of this answer is same as your question (To import more than 50 posts in a day) , Thats why I shared it. I don't have much knowledge of Python or GData API , I setup the environment & followed these steps to answer this question (And I was able to import posts from LiveJournal to Blogger with it ).
# build feed
request_feed = gdata.base.GBaseItemFeed(atom_id=atom.Id(text='test batch'))
# format each object
entry1 = gdata.base.GBaseItemFromString('--XML for your new item goes here--')
entry1.title.text = 'first batch request item'
entry2 = gdata.base.GBaseItemFromString('--XML for your new item here--')
entry2.title.text = 'second batch request item'
# Add each blog item to the request feed
request_feed.AddInsert(entry1)
request_feed.AddInsert(entry2)
# Execute the batch processes through the request_feed (all items)
result_feed = gd_client.ExecuteBatch(request_feed)
I am trying to add memcache to my webapp deployed on GAE, and to do this I am using memcache.Client() to prevent damage from any racing conditions:
from google.appengine.api import memcache
client = memcache.Client()
class BlogFront(BlogHandler):
def get(self):
global client
val = client.gets(FRONT_PAGE_KEY)
posts = list()
if val is not None:
posts = list(val)
else:
posts = db.GqlQuery("select * from Post order by created desc limit 10")
client.cas(FRONT_PAGE_KEY, list(posts))
self.render('front.html', posts = posts)
To test the problem I have a front page for a blog that displays the 10 most recent entries. If there is nothing in the cache, I hit the DB with a request, otherwise I just present the cached results to the user.
The problem is that no matter what I do, I always get val == None, thus meaning that I always hit the database with a useless request.
I have sifted through the documentation:
https://developers.google.com/appengine/docs/python/memcache/
https://developers.google.com/appengine/docs/python/memcache/clientclass
http://neopythonic.blogspot.pt/2011/08/compare-and-set-in-memcache.html
And it appears that I am doing everything correctly. What am I missing?
(PS: I am a python newb, if this is a retarded error, please bear with me xD )
from google.appengine.api import memcache
class BlogFront(BlogHandler):
def get(self):
client = memcache.Client()
client.gets(FRONT_PAGE_KEY)
client.cas(FRONT_PAGE_KEY, 'my content')
For a reason I cannot yet possible understand, the solution lies in having a gets right before having a cas call ...
I think I will stick with the memcache non-thread-safe version of the code for now ...
I suspect that the client.cas call is failing because there is no object. Perhaps client.cas only works to update and existing object (not to set a new object if there is none currently)? You might try client.add() (which will fail if an object already exists with the specified key, which I think is what you want?) instead of client.cas()
I'm brand new at Python and I'm trying to write an extension to an app that imports GA information and parses it into MySQL. There is a shamfully sparse amount of infomation on the topic. The Google Docs only seem to have examples in JS and Java...
...I have gotten to the point where my user can authenticate into GA using SubAuth. That code is here:
import gdata.service
import gdata.analytics
from django import http
from django import shortcuts
from django.shortcuts import render_to_response
def authorize(request):
next = 'http://localhost:8000/authconfirm'
scope = 'https://www.google.com/analytics/feeds'
secure = False # set secure=True to request secure AuthSub tokens
session = False
auth_sub_url = gdata.service.GenerateAuthSubRequestUrl(next, scope, secure=secure, session=session)
return http.HttpResponseRedirect(auth_sub_url)
So, step next is getting at the data. I have found this library: (beware, UI is offensive) http://gdata-python-client.googlecode.com/svn/trunk/pydocs/gdata.analytics.html
However, I have found it difficult to navigate. It seems like I should be gdata.analytics.AnalyticsDataEntry.getDataEntry(), but I'm not sure what it is asking me to pass it.
I would love a push in the right direction. I feel I've exhausted google looking for a working example.
Thank you!!
EDIT: I have gotten farther, but my problem still isn't solved. The below method returns data (I believe).... the error I get is: "'str' object has no attribute '_BecomeChildElement'" I believe I am returning a feed? However, I don't know how to drill into it. Is there a way for me to inspect this object?
def auth_confirm(request):
gdata_service = gdata.service.GDataService('iSample_acctSample_v1.0')
feedUri='https://www.google.com/analytics/feeds/accounts/default?max-results=50'
# request feed
feed = gdata.analytics.AnalyticsDataFeed(feedUri)
print str(feed)
Maybe this post can help out. Seems like there are not Analytics specific bindings yet, so you are working with the generic gdata.
I've been using GA for a little over a year now and since about April 2009, i have used python bindings supplied in a package called python-googleanalytics by Clint Ecker et al. So far, it works quite well.
Here's where to get it: http://github.com/clintecker/python-googleanalytics.
Install it the usual way.
To use it: First, so that you don't have to manually pass in your login credentials each time you access the API, put them in a config file like so:
[Credentials]
google_account_email = youraccount#gmail.com
google_account_password = yourpassword
Name this file '.pythongoogleanalytics' and put it in your home directory.
And from an interactive prompt type:
from googleanalytics import Connection
import datetime
connection = Connection() # pass in id & pw as strings **if** not in config file
account = connection.get_account(<*your GA profile ID goes here*>)
start_date = datetime.date(2009, 12, 01)
end_data = datetime.date(2009, 12, 13)
# account object does the work, specify what data you want w/
# 'metrics' & 'dimensions'; see 'USAGE.md' file for examples
account.get_data(start_date=start_date, end_date=end_date, metrics=['visits'])
The 'get_account' method will return a python list (in above instance, bound to the variable 'account'), which contains your data.
You need 3 files within the app. client_secrets.json, analytics.dat and google_auth.py.
Create a module Query.py within the app:
class Query(object):
def __init__(self, startdate, enddate, filter, metrics):
self.startdate = startdate.strftime('%Y-%m-%d')
self.enddate = enddate.strftime('%Y-%m-%d')
self.filter = "ga:medium=" + filter
self.metrics = metrics
Example models.py: #has the following function
import google_auth
service = googleauth.initialize_service()
def total_visit(self):
object = AnalyticsData.objects.get(utm_source=self.utm_source)
trial = Query(object.date.startdate, object.date.enddate, object.utm_source, ga:sessions")
result = service.data().ga().get(ids = 'ga:<your-profile-id>', start_date = trial.startdate, end_date = trial.enddate, filters= trial.filter, metrics = trial.metrics).execute()
total_visit = result.get('rows')
<yr save command, ColumnName.object.create(data=total_visit) goes here>