How to build a webscraping function for subreddit?

How to build a webscraping function for subreddit? - python

Summary: I want to web scrape subreddit and then turn data into data-frames. I know how to do them individually. But I am stuck with using a function.
Here is how I do it one by one.
url = 'https://api.pushshift.io/reddit/search/submission'
params3 = {'subreddit':'Apple', 'size': 500,'before':1579411194}
res3 = requests.get(url, params3)
data = res3.json()
post3 = data['data']
apdf3 = pd.DataFrame(post3)
Here is the function I came up with so far:
url = 'https://api.pushshift.io/reddit/search/submission'
def webscrape (subreddit, size,):
for i in range(1, 11):
params = {"subreddit":subreddit, 'size':size, 'before': f'post{i}'[-1]['created_utc']}
res = requests.get(url, params)
f'data{i}' = res.json()
f'post{i}' = data[f'data{i}']
f'ap_df{i}' = pd.DataFrame(f'post{i})
My problem is that my first parameter doesn't need 'before'. But after the 'post' is created I need to use 'before' in order for me to get all the posts that are earlier than the last post from the previous action. How do I reconcile this conflict?
Many thanks!

What you are asking for is doable, but I don't think f-strings will work here. The code below attaches each dataframe to a dictionary of dataframes. Try it and see if it works:
d = {}
url = 'https://api.pushshift.io/reddit/search/submission'
def webscraper (subreddit, size,):
bef = 0
for i in range(1, 11):
if i==1:
params = {"subreddit":subreddit, 'size':size}
else:
params = {"subreddit":subreddit, 'size':size, 'before': bef}
res = requests.get(url, params)
data = res.json()
dat = data['data']
bef = dat[-1]['created_utc']
df_name = subreddit+str(i)
d[df_name] = pd.DataFrame(dat)

Related

get request payload in python

my code is about send get request using query parameters which depends on a page number
After that o have to do for loop to get some ids from the response and also getting the next page number of the same response
and send a new get request with the new next page number that I got from the first response, and I need to get the ids also from the new response
My code works fine , but I’m using two loop which it’s not the right way I think? I couldn’t do it with one loop any ideas?
def get():
response = requests.get(url, headers=header)
data = response.text
data = json.loads(data)
check_if_theres_next_page = data['pagination']['hasMorePages']
check_for_next_page_number = data['pagination']['nextPage']
last_page_number = data['pagination']['lastPage']
orders = data['orders']
list_of_ids = []
for manufacturingOrderId in orders:
ids = manufacturingOrderId['manufacturingOrderId']
list_of_ids.append(ids)
if check_for_next_page_number == 4:
check_for_next_page_number = last_page_number
if check_if_theres_next_page:
url_ = url + '&page_number=' + str(check_for_next_page_number)
response = requests.get(url_, headers=header)
data = response.text
data = json.loads(data)
orders = data['orders']
for manufacturingOrderId_ in orders:
ids = manufacturingOrderId_['manufacturingOrderId']
list_of_ids.append(ids)
if "nextPage" in data['pagination']:
check_for_next_page_number = data['pagination']['nextPage']
else:
check_if_theres_next_page = False
return list_of_ids

Python - Loop through each page to get all records

I would like to retrieve all records (total 50,000) from an API endpoint. The endpoint only returns a maximum of 1000 records per page. Here's the function to get the records.
def get_products(token,page_number):
url = "https://example.com/manager/nexus?page={}&limit={}".format(page_number,1000)
header = {
"Authorization": "Bearer {}".format(token)
}
response = requests.get(url, headers=header)
product_results = response.json()
total_list = []
for result in product_results['Data']:
date = result['date']
price = result['price']
name = result['name']
total_list.append((date,price,name))
columns = ['date', 'price', 'name']
df = pd.DataFrame(total_list, columns=columns)
results = json.dumps(total_list)
return df, results
How can I loop through each page until the final record without hardcoding the page numbers? Currently, I'm hardcoding the page numbers as below for the first 2 pages to get 2000 records as a test.
for page_number in np.arange(1,3):
token = get_token()
product_df,product_json = get_products(token,page_number)
if page_number==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)
Thank you.

I don't know about the behavior of the endpoint. Assuming when the page number is greater than the last page number, you would get an empty list instead. If that is the case, you could just check if the result is empty.
page_number = 1
token = get_token()
product_df, product_json = get_products(token,page_number)
product_all=product_df
while product_df.size:
page_number = page_number + 1
token = get_token()
product_df,product_json = get_products(token,page_number)
product_all=pd.concat([product_all,product_df])
print(product_all)
If you are sure there are 1000 records max per page, you could check if the result count is less than 1000 and stop the loop.

It depends on how your backend method: json GET's return.
page and limit are required. you may rewrite the json return all data. in stead of just every 1000.
num = int(50000/1000);
for i in range(1, num):
token = get_token()
product_df,product_json = get_products(token, i)
if i==1:
product_all=product_df
else:
product_all=pd.concat([product_all,product_df])
print(product_all)

I want to go through an API data and display it with print in my console

when I try to display the number of cases and name of the country from all the countries it gives me an error, i tried many ways without success.
here is the code:
url = "https://coronavirus-19-api.herokuapp.com/countries"
response = requests.get(url).json()
j = range(0, 10)
all = {
'country': response[i]['country'],
'confirmed': response[i]['cases'],
}
for i in j:
for i in response:
print(all)

import requests
url = "https://coronavirus-19-api.herokuapp.com/countries"
response = requests.get(url).json()
data = []
for line in response:
tmp = {}
tmp['country'] = line['country']
tmp['confirmed'] = line['cases']
data.append(tmp)
print(data)

I found the solution but you have to know the number of countries in the API.
here it is:
url = "https://coronavirus-19-api.herokuapp.com/countries"
response = requests.get(url).json()
i = 0
for i in range(0, 189):
all = {
'country': response[i]['country'],
'confirmed': response[i]['cases'],
}
print(all)

How to iterate over a list of IDs to produce URLs for each ID every time?

I've got a list of IDs which I want to pass through the URLs to collect the data on the comments. But I'm kinda of newb and when I'm trying to iterate over the list, I'm getting only one url and consequently data for one comment. Can someone, please, explain me what's wrong with my code and how to get URLs for all IDs in a list and consequently collect the data for all comments?
comments_from_reddit = ['fkkmga7', 'fkkgxtj', 'fkklfx3', ...]
def getPushshiftData():
for ID in range(len(comments_from_reddit)):
url = 'https://api.pushshift.io/reddit/comment/search?ids={}'.format(comments_from_reddit[ID])
print(url)
req = requests.get(url)
data = json.loads(req.text)
return data['data']
data = getPushshiftData()
Output I'm getting: https://api.pushshift.io/reddit/comment/search?ids=fkkmga7
I will really appreciate any help on my issue. Thanks for your attention.

This should work:
comments_from_reddit = ['fkkmga7', 'fkkgxtj', 'fkklfx3', ...]
def getPushshiftData():
result = list()
for ID in range(len(comments_from_reddit)):
url = 'https://api.pushshift.io/reddit/comment/search?ids={}'.format(comments_from_reddit[ID])
print(url)
req = requests.get(url)
data = json.loads(req.text)
result.append( data['data'] )
return result
data = getPushshiftData()

Append multiple values in a array

'I am trying to fetch deals data from Hubspot, I am trying to fetch dealid and deal name in this example to simplify the question but later I will more properties. I have the following code that gives me an array of dealIds and one array of deal names. I could I make it that instead of multiple arrays I get the following instead:
{{12345,'deal1'}, {12346,'deal2'}, {12347,'deal3'}}
or something like:
{{'dealId': 12345, 'dealname' : 'deal1'}}
This is my code so far:
deals = []
names = []
def getdeals():
apikey = "demo"
url = 'https://api.hubapi.com/deals/v1/deal/paged?hapikey='+apikey+'&properties=dealname&limit=250'
response = requests.get(url)
jsonDeals = response.json()
for deal in jsonDeals['deals']:
properties = deal['properties']
deals.append(deal['dealId'])
names.append(properties['dealname']['value'])

You already have the data in json. Its just how you want to map and store it.
output={}
def getdeals():
apikey = "demo"
url = 'https://api.hubapi.com/deals/v1/deal/paged?hapikey='+apikey+'&properties=dealname&limit=250'
response = requests.get(url)
jsonDeals = response.json()
for deal in jsonDeals['deals']:
properties = deal['properties']
output.update({deal['dealId']: properties['dealname']['value']})

This can be solved using list comprehension:
[{'dealId':deal['dealId'],'dealname':deal['properties']['dealname']['value']} for deal in jsonDeals['deals']]

AS E.Serra suggested deal_obj = {'dealname': properties['dealname']['value'], 'dealid':deal['dealId']} solved the issue.
here is the updated code:
%%time
deals = []
def getdeals():
apikey = "demo"
url = 'https://api.hubapi.com/deals/v1/deal/paged?hapikey='+apikey+'&properties=dealname&limit=250'
response = requests.get(url)
jsonDeals = response.json()
for deal in jsonDeals['deals']:
properties = deal['properties']
deal_obj = {'dealname': properties['dealname']['value'], 'dealid':deal['dealId']}
deals.append(deal_obj)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to build a webscraping function for subreddit? - python

Related

get request payload in python

Python - Loop through each page to get all records

I want to go through an API data and display it with print in my console

How to iterate over a list of IDs to produce URLs for each ID every time?

Append multiple values in a array

Categories

Resources