Historical data using Reddit API

Historical data using Reddit API - python

I am trying to access historical data of mentions of a few keywords for a data analysis project using Reddit API. Utilizing python's wonderful easy-to-use PRAW package to fetch the data. Does anyone know if Reddit api has any functionality that allows historical access to data from a subreddit?

You can only get the last 1000 items for a specific view. Use the Subreddit's submissions property.
You can get the different views. I describe some of the other views in a reddit comment I made:
yes. generally speaking you can get the last 1000 items in a listing
(/r/all and /r/popular listings are higher), regardless of how long
ago it is.
to get more than 1000 items:
check all of the views (/r/subreddit/top, etc) and over all of the time scales
check all of the moderation queues (with parameter only=links):
unmoderated (/about/unmoderated)
moderation queue (/about/modqueue)
spam (/about/spam)
edited (/about/edited)
reports (/about/reports)
if this is a public subreddit, consider also using pushshift.io

Related

How to use Reddit PSRAW to fetch upvotes of posts of certain sections?

Earlier I was using Reddit API which does provide upvote counts but due to the limitation of returned results I am forced to use PushShift API. It seems Push shift does not return vote counts and also does not filter posts by section just like Reddit does.
Is there any way to achieve this? I tried /info Endpoint: https://www.reddit.com/api/info.json?id=<ID> but it did not work either

PRAW: How to get a Reddit user total number of submissions when it is greater than 1000?

I am trying to get a Reddit user total number of submissions, but the Reddit API is limited to showing only 1000 posts.
Because of this, the following code will not work for users that have more than a thousand submissions:
import praw
reddit = praw.Reddit(client_id='CLIENT_ID',
client_secret='SECRET_KEY',
user_agent='USER_AGENT',
username='USERNAME',
password='PASSWORD')
counter = 0
submissions = reddit.redditor('REDDIT_USERNAME').submissions.new(limit=None)
for submission in submissions:
counter += 1
print(counter)
Likewise, I have tried simply doing print(len(submissions)), but I get the following:
TypeError: object of type 'ListingGenerator' has no len()
Is there any way to get a user total number of submissions if he/she has more than a 1000 posts?
Thanks in advance!

This is not possible with PRAW or any other Reddit API client.
Reddit's API limits listings to about 1000 items. You cannot use the Reddit API to get more items than that, using PRAW or any other Reddit API Wrapper.
However, third party services like PushShift have Reddit data and APIs to get more than 1000 posts by a user, with the caveat that the items must be public.

There's one of a few ways, and some are more perfect than others.
You can visit each sort (so like sort by hot, new, top) over all of the time periods (day, week, month, year, all). Depending on the activity of the user, this may be enough. You can use the Pushshift API to get public submissions.
I explain in a comment I made on the redditdev subreddit:
yes. generally speaking you can get the last 1000 items in a listing
(/r/all and /r/popular listings are higher), regardless of how long
ago it is.
to get more than 1000 items:
[...]
if this is a public subreddit, consider also using pushshift.io

How to page through results using Shopify Python API wrapper

I want to page through the results from the Shopify API using the Python wrapper. The API recently (2019-07) switched to "cursor-based pagination", so I cannot just pass a "page" query parameter to get the next set of results.
The Shopify API docs have a page dedicated to cursor-based pagination.
The API response supposedly includes a link in the response headers that includes info for making another request, but I cannot figure out how to access it. As far as I can tell, the response from the wrapper is a standard Python list that has no headers.
I think I could make this work without using the python API wrapper, but there must be an easy way to get the next set of results.
import shopify
shopify.ShopifyResource.set_site("https://example-store.myshopify.com/admin/api/2019-07")
shopify.ShopifyResource.set_user(API_KEY)
shopify.ShopifyResource.set_password(PASSWORD)
products = shopify.Product.find(limit=5)
# This works fine
for product in products:
print(product.title)
# None of these work for accessing the headers referenced in the docs
print(products.headers)
print(products.link)
print(products['headers'])
print(products['link'])
# This throws an error saying that "page" is not an acceptable parameter
products = shopify.Product.find(limit=5, page=2)
Can anyone provide an example of how to get the next page of results using the wrapper?

As mentioned by #babis21, this was a bug in the shopify python api wrapper. The library was updated in January of 2020 to fix it.
For anyone stumbling upon this, here is an easy way to page through all results. This same format also works for other API objects like Products as well.
orders = shopify.Order.find(since_id=0, status='any', limit=250)
for order in orders:
# Do something with the order
while orders.has_next_page():
orders = orders.next_page()
for order in orders:
# Do something with the remaining orders
Using since_id=0 will fetch ALL orders because order IDs are guaranteed to be greater than 0.
If you don't want to repeat the code that processes the order objects, you can wrap it all in an iterator like this:
def iter_all_orders(status='any', limit=250):
orders = shopify.Order.find(since_id=0, status=status, limit=limit)
for order in orders:
yield order
while orders.has_next_page():
orders = orders.next_page()
for order in orders:
yield order
for order in iter_all_orders():
# Do something with each order
If you are fetching a large number of orders or other objects (for offline analysis like I was), you will find that this is slow compared to your other options. The GraphQL API is faster than the REST API, but performing bulk operations with the GraphQL API was by far the most efficient.

You can find response header with below code
resp_header = shopify.ShopifyResource.connection.response.headers["link"];
then you can split(',') string of link index and then remove(<>) and can get next link url.
I am not familiar with python , but i think i will work, you can also review below links:
https://community.shopify.com/c/Shopify-APIs-SDKs/Python-API-library-for-shopify/td-p/529523
https://community.shopify.com/c/Shopify-APIs-SDKs/Trouble-with-pagination-when-fetching-products-API-python/td-p/536910
thanks

#rseabrook
I have exactly the same issue, it seems others do as well and someone has raised this: https://github.com/Shopify/shopify_python_api/issues/337
where I see there is an open PR for this: https://github.com/Shopify/shopify_python_api/pull/338
I guess it should be ready soon, so an alternative idea would be to wait a bit and use 2019-04 version (which supports the page parameter to perform pagination).
UPDATE: It seems this has been released now: https://github.com/Shopify/shopify_python_api/pull/352

How to build a real time recommendation engine with good performance?

I am a data analyst and just assigned to build a real time recommendation engine for our website.
I need to analyse the visitor behavior and do real time analysis for those input. Thus I have three questions about this project.
1) Users are not forced to sign-up. Is there any methodology to capture user behavior such as search and visit history?
2) The recommendation models can be pre-trained but the prediction process takes time. How can we improve the performance?
3) I only know how to write Python Scripts. How can I implement the recommendation engine with my python scripts?
Thanks.
===============
However, 90% of our customers purchase the products during their first visit and will not come back shortly.
We cannot make a ready model for new visitors.
And they prefer to use itemCF for the recommendation engine.
It sounds like a mission impossible now...

This is quite a broad question however I will do my best to answer:
Visit history can be tracked by enabling some form of analytics tracking on your domain. This can either be a pre-built solution that you implement and will provide a detailed overview of all visitors to your domain, usually with some form of dashboard. Most pre-built solutions provide a way to obtain the analytics that have been collected.
Another way would be to use browser cookies to store information pertaining to each visit to your domain and their search/page history. This information will be available to the website whenever the user visits it within the same browser. When the user visits your website, you could send the information to a server/rest endpoint which could analyse information (I.P/geolocation/number of visits/search/page history) and make recommendation based on that. Another common method is to track past purchases ect.
To improve performance one solution would be to always have the prediction model for a particular user ready for when they next visit the site. That way, there is no delay. However, the first time a user visits you likely wont have enough information to make detailed predictions so you will have to resort to providing options based on geolocation (which shouldn't take to long and wont impact performance)
There is another approach that can be taken and above mainly talked about making predictions based on a users behavior browsing the website. Content-based filtering is another approach which will recommend things that are similar to a item that the user is currently viewing. This approach is generally easier, as it just requires that you query a database for items that are similar in category, purpose/use ect.
There is no getting around using javascript for the client side stuff, however your recommendation engine can be built in Python (it could be a simple REST api endpoint with access to the items database). Most people use flask,django or eve to implement REST API's in Python.

Surveymonkey API: check if a specific email has completed survey

I have a large amount of users (over 400k) that have been sent a survey to complete. As part of logging into my site I'm using the surveymonkey api to check to see if they completed their assigned survey. I'm keying on email address. I'm thinking of using:
https://developer.surveymonkey.com/mashery/get_respondent_list
however, I don't want to page through all 400k users to find a specific email - anyway to do this search more efficiently?
Using django backend to ping the surveymonkey api

get_respondent_list allows you to search for respondents by modified date/time range. For 400K respondents, you should store the results in a local database and only query the API when the email address you're looking for isn't found locally.
To avoid having to parse the whole list every time, you should only get new respondents since the last time your checked using that date/time range feature and add the new respondents to your DB. There is some example code which illustrates polling for new respondents based on date/time range on SurveyMonkey's public GitHub here:
https://github.com/SurveyMonkey/python_guides/blob/master/guides/polling.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.