PRAW 6: Get all submission of a subreddit - python

I'm trying to iterate over submissions of a certain subreddit from the newest to the oldest using PRAW. I used to do it like this:
subreddit = reddit.subreddit('LandscapePhotography')
for submission in subreddit.submissions(None, time.time()):
print("Submission Title: {}".format(submission.title))
However, when I try to do it now I get the following error:
AttributeError: 'Subreddit' object has no attribute 'submissions'
From looking at the docs I can't seem to figure out how to do this. The best I can do is:
for submission in subreddit.new(limit=None):
print("Submission Title: {}".format(submission.title))
However, this is limited to the first 1000 submissions only.
Is there a way to do this with all submissions and not just the first 1000 ?

Unfortunately, Reddit removed this function from their API.
Check out the PRAW changelog. One of the changes in version 6.0.0 is:
Removed
Subreddit.submissions as the API endpoint backing the method is no more. See
https://www.reddit.com/r/changelog/comments/7tus5f/update_to_search_api/.
The linked post says that Reddit is disabling Cloudsearch for all users:
Starting March 15, 2018 we’ll begin to gradually move API users over to the new search system. By end of March we expect to have moved everyone off and finally turn down the old system.
PRAW's Subreddit.sumbissions() used Cloudsearch to search for posts between the given timestamps. Since Cloudsearch has been removed and the search that replaced it doesn't support timestamp search, it is no longer possible to perform a search based on timestamp with PRAW or any other Reddit API client. This includes trying to get all posts from a subreddit.
For more information, see this thread from /r/redditdev posted by the maintainer of PRAW.
Alternatives
Since Reddit limits all listings to ~1000 entries, it is currently impossible to get all posts in a subreddit using their API. However, third-party datasets with APIs exist, such as pushshift.io. As /u/kungming2 said on Reddit:
You can use Pushshift.io to still return data from defined time
periods by using their API:
https://api.pushshift.io/reddit/submission/search/?after=1334426439&before=1339696839&sort_type=score&sort=desc&subreddit=translator
This, for example, allows you to parse submissions to r/translator
between 2012-04-14 and 2012-06-2014.

You can retrieve all the data from pushshift.io using an iterative loop. Just set the start date as the current epoch date, and get 1000 items, then put the created_utc of the last items in the list as the before parameter to get the next 1000 items and keeps going until it stops returning.
Below is a useful link for further information:
https://www.reddit.com/r/pushshift/comments/b7onr6/max_number_of_results_returned_per_query/enter link description here

Pushshift doesn't work for private subreddits. In that case you can create a database 1000 submissions at a time from now on (not retroactive).
If you just need as many submissions as possible you could try using the different sort methods top, hot, new and combine them.

Related

Can I get the view counts of a video by date using YouTube Data API v3?

I want to find out the how “trending” a video is. Is there a way to get video counts by date?
Like this:
2020/5/11 1,234,567
2020/5/10 1,200,000
...
Or maybe new views increased by date? Or view counts at a certain date? I’m fine with everything.
Updates
Last night was pretty late, and I did not realize I can get so many downvotes!
I am able to "connect" the YouTube Data API and OAuth 2.0 with my credentials. I am referring to the documentation of the former, which can be found here: https://developers.google.com/youtube/v3/docs
From my reading, I had found (and tried) rate and getRating methods under "Videos" section. Obviously they did not work since rate can be used when "I" upvote a video and getRating can only get a binary result (whether "liked" or not).
What I did with getRating
request = youtube.videos().getRating(
id="Ks-_Mh1QhMc,c0KYU2j0TM4,eIho2S0ZahI"
)
response = request.execute()
response
Then I tried changing the part argument under list under "Videos". This gladly works. However, it only gets the views, likes, and dislikes right now.
What I did with changing part argument
request = youtube.videos().list(
part="id, statistics",
id="njn6krU3tQ8"
)
response = request.execute()
response
Now the problem is how can I get views, likes, and dislikes by date? From what I read from the documentation, there's nothing related to "date" under `list'. I also did some research about this. Of course, no answers can solve my question, at least from my research.
Miscellaneous stuffs for my current comment(s) and answer(s)
I do not have any data. The point of these is to collect some data in order to use later in R, which I am more familiar with.
This is not my "work". I am doing a research on "how covid 19 affect YouTube views" sort of stuffs. I just want to find out if I can find anything interesting.
I am sorry that I did not add enough information about what I was currently doing. I was too tried and stayed late yesterday.
If the data you can get in a dataframe, then using groupby() function for Pandas Dataframe, you achieve your desired result.
df.set_index('your_column_name').groupby(pd.TimeGrouper('D')).sum().dropna()
Querying the YouTube Analytics API for video reports can return the metrics of a video for a specific period.
Just set the dimensions parameter to video,day if you want the results for each video per day and specify the video ids in the filter parameter.
The scope needed for this is yt-analytics.readonly.

Twitter scraping in Python

I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?
Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.

How to get list of categories in Google Books API?

I was searching for an already answered question about this but couldn't find one so please forgive me if I somehow missed it.
I'm using Google Books API and I know I can search a book by specific category.
My question is, how can I get all the available categories from the API?
I looked in the API documentation but couldn't find any mention of this.
The Google books api does not have an end point for returning Categories that are not associated with a book itself.
The Google Books api is only there to list books. You can
search and browse through the list of books that match a given query.
view information about a book, including metadata, availability and price, links to the preview page.
manage your own bookshelves.
You can see the category of a book you can not get a list of available categories in the whole system
You may be interested to know this has been on their todo list since 2012 category list
We have numerous requests for this and we're investigating how we can properly provide the data. One issue is Google does not own all the category information. "New York Times Bestsellers" is one obvious example. We need to first identify what we can publish through the API.
work around
i worked around it by implementing my own category list mechanism so i can pull all categories that exists in my app's database.
(unfortunately, the newly announced ScriptDb deprecation means my whole system will go to waste in a couple of monthes anyway... but that's another story)
https://support.google.com/books/partner/answer/3237055?hl=en
Scroll down to subject/genres and you will see this link.
https://bisg.org/page/bisacedition
This list is apparently a list of subjects AKA categories for North American Books. I am making various GET requests with an API testing tool and getting for the most part, perfect matches (you may have to drop a word from the query string. ex: "criticism" instead of "literary criticism") for whatever subject I choose from the BISG subjects list, and what comes back in the json response under the "categories" key.
Ex: GET https://www.googleapis.com/books/v1/volumes?q=business+subject:juvenile+fiction
Long story short, the BISG link is where I'm pretty sure Google got all the options for their "categories" key from.

SoundCloud API ignoring duration filter

Following the SoundCloud API Documentation at https://developers.soundcloud.com/docs/api/reference#tracks, I started to write an implementation of the SoundCloud API in one of my projects. I tried to get 50 tracks of a specific genre with a minimum length of 120000ms using this code:
def get_starttracks(genres="Rock"):
return client.get("/tracks", genres=genres, duration={
'from': 120000
}, limit='50')
SoundCloud responds with a valid list of tracks, but their durations don't match the given Filter.
Example:
print(get_starttracks(genres="Pop")[0].fields()['duration'])
> 30000
Is the api ignoring the 'duration'-parameter or is there an error in the filter inside of my code?
Ps.: Could be related to soundcloud search api ignoring duration filter?, if error isn't inside of the python code.
After trying to fix this problem with several changes to my code, I finally found the issue:
It's NOT a bug. As Soundcloud released their "Go+"-Services, some official tracks got limited to a preview of 30 seconds. The API filter seems to compare the duration of the full track, while just sending the preview-version back to the client (if you don't have subscribed to "Go+" and/or your application is not logged-in).
So, the only way to filter by duration is to iterate through all received tracks:
for track in tracks:
if track.duration <= 30000:
tracks.remove(track)

Python Twitter Statistics

I need to get the number of people who have followed a certain account by month, also the number of people who have unfollowed the same account by month, the total number of tweets by month, and the total number of times something the account tweeted has been retweeted by month.
I am using python to do this, and have installed python-twitter, but as the documentation is rather sparse, I'm having to do a lot of guesswork. I was wondering if anyone could point me in the right direction? I was able to get authenticated using OAuth, so thats not an issue, I just need some help with getting those numbers.
Thank you all.
These types of statistical breakdowns are not generally available via the Twitter API. Depending on your sample date range, you may have luck using Twittercounter.com's API (you can sign up for an API key here).
The API is rate limited to 100 calls per hour, unless you get whitelisted. You can get results for the previous 14 days. An example request is below:
http://api.twittercounter.com?twitter_id=813286&apikey=[api_key]
The results, in JSON, look like this:
{"version":"1.1","username":"BarackObama","url":"http:\/\/www.barackobama.com","avatar":"http:\/\/a1.twimg.com\/profile_images\/784227851\/BarackObama_twitter_photo_normal.jpg","followers_current":7420937,"date_updated":"2011-04-16","follow_days":"563","started_followers":"2264457","growth_since":5156480,"average_growth":"9166","tomorrow":"7430103","next_month":"7695917","followers_yesterday":7414507,"rank":"3","followers_2w_ago":7243541,"growth_since_2w":177396,"average_growth_2w":"12671","tomorrow_2w":"7433608","next_month_2w":"7801067","followersperdate":{"date2011-04-16":7420937,"date2011-04-15":7414507,"date2011-04-14":7400522,"date2011-04-13":7385729,"date2011-04-12":7370229,"date2011-04-11":7366548,"date2011-04-10":7349078,"date2011-04-09":7341737,"date2011-04-08":7325918,"date2011-04-07":7309609,"date2011-04-06":7306325,"date2011-04-05":7283591,"date2011-04-04":7269377,"date2011-04-03":7257596},"last_update":1302981230}
The retweet stats aren't available from Twittercounter, but you might be able to obtain those from Favstar (although they don't have a public API currently.)
My problem is I also need to get unfollow statistics, which twittercounter does not supply.
My solution was to access the twitter REST API directly, using the oauth2 library in python. I found this very simple compared to some of the other twitter libraries for python out there. This example was particularly helpful: http://parand.com/say/index.php/2010/06/13/using-python-oauth2-to-access-oauth-protected-resources/

Categories

Resources