Get unique values from all lists (Python API Requests)

Get unique values from all lists (Python API Requests) - python

I am trying to get unique values from the all lists titled "DataClasses" from an API request. With my code it only prints the first list titled "DataClasses" but there are at least 40 other lists called that.
I am confused on how to get all of the lists titled "DataClasses" if I don't know how many times it will be listed on an API request.
import requests
import json
import time
import re
with open('dataclasstest2.txt', 'r') as f:
emails = [line.strip() for line in f]
def main():
for name in emails:
url = 'https://haveibeenpwned.com/api/v2/breachedaccount/' + name
headers = {
'User-Agent': 'dataclasses.py'
'Accept: application/vnd.haveibeenpwned.v2+json'
}
r = requests.get(url, headers=headers)
if r.status_code == 200:
data = r.json()[0].get('DataClasses')
print(name + ":" + ', '.join(data))
time.sleep(2)
else:
time.sleep(2)
if __name__ == "__main__":
main()
Here is the output I am getting:
test#example.com:Email addresses, IP addresses, Names, Passwords

Related

How do I submit a Stack Exchange API query that returns the same results as the basic Stack Overflow search?

I am currently working on a project with the goal of determining the popularity of various topics on gis.stackexchange. I am using Python to interface with the stack exchange API. My issue is I am having trouble configuring the API query to match what a basic search using the search bar would return (showing posts containing the term (x)). I am currently using the /search/advanced... q="term" method, however I am getting empty results for search terms that might have around 100-200 posts. I have read a lot of the API documentation, but can't seem to configure the API query to match what a site search would yield.
Edit: For example, if I search, "Bayesian", I get 42 results on gis.stackexchange, but when I set q=Bayesian in the API request I get an empty return.
I have included my program below if it helps. Thanks!
#Interfacing_with_SO_API
import requests as rq
import json
import time
keywordinput = input('Enter your search term. If two words seperate by - : ')
baseurl = ('https://api.stackexchange.com/2.3/search/advanced?page=')
endurl = ('&pagesize=100&order=desc&sort=votes&q=' + keywordinput + '&site=gis.stackexchange&filter=!-nt6H9O0imT9xRAnV1gwrp1ZOq7FBaU7CRaGpVkODaQgDIfSY8tJXb')
urltot = ('https://api.stackexchange.com/2.3/search/advanced?page=1&pagesize=100&order=desc&sort=votes&q=' + keywordinput + '&site=gis.stackexchange&filter=!-nt6H9O0imT9xRAnV1gwrp1ZOq7FBaU7CRaGpVkODaQgDIfSY8tJXb')
response = rq.get(urltot)
page = range(1,400)
if response.status_code == 400:
print('Initial Response Code 400: Stopping')
exit()
elif response.status_code == 200:
print('Initial Response Code 200: Continuing')
datarr = []
for n in page:
response = rq.get(baseurl + str(n) + endurl)
print(baseurl + str(n) + endurl)
time.sleep(2)
if response.status_code == 400 or response.json()['has_more'] == False or n >400:
print('No more pages')
break
elif response.json()['has_more'] == True:
for data in response.json()['items']:
if data['view_count'] >= 0:
datarr.append(data)
print(data['view_count'])
print(data['answer_count'])
print(data['score'])
#convert datarr to csv and save to file
with open(input('Search Term Name (filename): ') + '.csv', 'w') as f:
for data in datarr:
f.write(str(data['view_count']) + ',' + str(data['answer_count']) + ','+ str(data['score']) + '\n')
exit()

If you look at the results for searching bayesian on the GIS StackExchange site, you'll get 42 results because the StackExchange site search returns both questions and answers that contain the term.
However, the standard /search and /search/advanced API endpoints only search questions, per the doc (emphasis mine):
Searches a site for any questions which fit the given criteria
Discussion
Searches a site for any questions which fit the given criteria.
Instead, what you want to use is the /search/excerpts endpoint, which will return both questions and answers.
Quick demo in the shell to show that it returns the same number of items:
curl -s --compressed "https://api.stackexchange.com/2.3/search/excerpts?page=1&pagesize=100&site=gis&q=bayesian" | jq '.["items"] | length'
42
And a minimal Python program to do the same:
#!/usr/bin/env python3
# file: test_so_search.py
import requests
if __name__ == "__main__":
api_url = "https://api.stackexchange.com/2.3/search/excerpts"
search_term = "bayesian"
qs = {
"page": 1,
"pagesize": 100,
"order": "desc",
"sort": "votes",
"site": "gis",
"q": search_term
}
rsp = requests.get(api_url, qs)
data = rsp.json()
print(f"Got {len(data['items'])} results for '{search_term}'")
And output:
> python test_so_search.py
Got 42 results for 'bayesian'

Apply the code for smaller batches in the data set sequentially

I have data set of retrieved tweets via the Twitter streaming API.
However, I regularly want to be updated about how the public metrics change. Therefore, I wrote a code to request those public metrics:
def create_url():
tweet_fields = "tweet.fields=public_metrics"
tweets_data_path = 'dataset.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df_id = (str(str((df['id'].tolist()))[1:-1])).replace(" ", "")
ids = "ids=" + df_id
url = "https://api.twitter.com/2/tweets?{}&{}".format(ids, tweet_fields)
return url
def bearer_oauth(r):
r.headers["Authorization"] = f"Bearer {'AAAAAAAAAAAAAAAAAAAAAN%2B7QwEAAAAAEG%2BzRZkmZ4HGizsKCG3MkwlaRzY%3DOwuZeaeHbeMM1JDIafd5riA1QdkDabPiELFsguR4Zba9ywzzOQ'}"
r.headers["User-Agent"] = "v2TweetLookupPython"
return r
def connect_to_endpoint(url):
response = requests.request("GET", url, auth=bearer_oauth)
print(response.status_code)
if response.status_code != 200:
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
return response.json()
def main():
url = create_url()
json_response = connect_to_endpoint(url)
print(json.dumps(json_response, indent=3, sort_keys=True))
if __name__ == "__main__":
main()
Unfortunately, my data set has more than 100 id's in it and I want to retrieve the metrics for all of them. As I can only request 100 id's at a time, can you maybe help me on how to do that?
Also, I would like to make the request daily at midnight and then store the file in a txt document, maybe you can also help me with that?

You can chunk your data and send it in batches using itertools.islice.
test.py:
import reprlib
from itertools import islice
import pandas as pd
BASE_URL = "https://api.twitter.com/2/tweets"
CHUNK = 100
def req(ids):
tmp = reprlib.repr(ids) # Used here just to shorten the output
print(f"{BASE_URL}?ids={tmp}")
def main():
df = pd.DataFrame({"id": range(1000)})
it = iter(df["id"])
while chunk := tuple(islice(it, CHUNK)):
ids = ",".join(map(str, chunk))
req(ids)
if __name__ == "__main__":
main()
Test:
$ python test.py
https://api.twitter.com/2/tweets?ids='0,1,2,3,4,5,...5,96,97,98,99'
https://api.twitter.com/2/tweets?ids='100,101,102,...6,197,198,199'
https://api.twitter.com/2/tweets?ids='200,201,202,...6,297,298,299'
https://api.twitter.com/2/tweets?ids='300,301,302,...6,397,398,399'
https://api.twitter.com/2/tweets?ids='400,401,402,...6,497,498,499'
https://api.twitter.com/2/tweets?ids='500,501,502,...6,597,598,599'
https://api.twitter.com/2/tweets?ids='600,601,602,...6,697,698,699'
https://api.twitter.com/2/tweets?ids='700,701,702,...6,797,798,799'
https://api.twitter.com/2/tweets?ids='800,801,802,...6,897,898,899'
https://api.twitter.com/2/tweets?ids='900,901,902,...6,997,998,999'
Note: You'll make multiple requests with this approach so keep in mind any rate limits.

Python check if website exists for a list of websites

I want to check if a website exists, given a list of websites in the format XXXXX.com, where XXXXX=a 5 digit number. So I want to go through from 00000 up to 99999 and see if those variants of the website exist.
I want to do something like
import requests
request = requests.get('http://www.example.com')
if request.status_code == 200:
print('Web site exists')
else:
print('Web site does not exist')
But generate a list of some sort (or even just export a list to csv), so for each URL, i know if it exists or not.
Any advice would be great!

I'm going to make an assumption that you have a large list of URLs and you want to read them in from some source file, let's say a text file, rather than hard-coding a large list of URLs in Python file, right. If that's the case, run the script below and you'll get what you want.
import urllib.request
import urllib.error
import time
from multiprocessing import Pool
start = time.time()
file = open('C:\\your_path\\check_me.txt', 'r', encoding="ISO-8859-1")
urls = file.readlines()
print(urls)
def checkurl(url):
try:
conn = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
# Return code error (e.g. 404, 501, ...)
# ...
print('HTTPError: {}'.format(e.code) + ', ' + url)
except urllib.error.URLError as e:
# Not an HTTP-specific error (e.g. connection refused)
# ...
print('URLError: {}'.format(e.reason) + ', ' + url)
else:
# 200
# ...
print('good' + ', ' + url)
if __name__ == "__main__":
p = Pool(processes=20)
result = p.map(checkurl, urls)
print("done in : ", time.time()-start)

Try combining xrange and the string zfill method in a loop.
import requests
def test_for_200(url):
req = requests.get(url)
return req.status_code == 200
def numbers():
for n in xrange(100000):
yield str(n).zfill(5)
results = {}
for num in numbers():
url = "http://{}.com".format(num)
results[num] = test_for_200(url)
results will look something like this:
>>> results
{'00000': True, '00001': False, ...}

Python 3.6 API while loop to json script not ending

I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks

The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break

How to increment variable in the middle of URL and output for multiple queries?

I would like to modify the code below to allow for searching multiple stores at once (via the four digit store number in the 'data' section below). What is the best way to accomplish this? Preferably I would be able to limit the search to 50-100 stores.
import requests
import json
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":"store=2516&size=18&dept=4044&query=43888060"} ).json()
data = json.loads(js['searchResults'])
res = data["results"][0]
print(res["name"], res["inventory"])
I would also like the store # printed in the line above.

Your data object in the request.post call can be constructed like any other string. And then a variable that represents it can take the place of your "store=2516..." string. Like this, assuming requests is defined in the outer function someplace and can be reused:
var stores = ["2516","3498","5478"];
stores.forEach( makeTheCall );
function makeTheCall( element, index, array ) {
storeQuery = "store=" + element + "&size=18&dept=4044&query=43888060";
js = requests.post("http://www.walmart.com/store/ajax/search",
data={"searchQuery":storeQuery} ).json()
data = json.loads(js['searchResults'])
res = data["results"][0]
console.log("name: " + res["name"] + ", store: " + element + ", inventory: " + res["inventory"]);
}
I'm not familiar with your use of "print", but I've only ever used client side javascript.

The API does not support searching for multiple stores, so you need to make multiple requests.
import requests
import json
from collections import defaultdict
results = defaultdict(list)
stores = [2516, 1234, 5678]
url = "http://www.walmart.com/store/ajax/search"
query = "store={}&size=18&dept=4044&query=43888060"
for store in stores:
r = requests.post(url, data={'searchQuery': query.format(store)})
r.raise_for_status()
try:
data = json.loads(r.json()['searchResults'])['results'][0]
results[store].append((data['name'],data['inventory']))
except IndexError:
continue
for store, data in results.iteritems():
print('Store: {}'.format(store))
if data:
for name, inventory in data:
print('\t{} - {}'.format(name, inventory))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get unique values from all lists (Python API Requests) - python

Related

How do I submit a Stack Exchange API query that returns the same results as the basic Stack Overflow search?

Apply the code for smaller batches in the data set sequentially

Python check if website exists for a list of websites

Python 3.6 API while loop to json script not ending

How to increment variable in the middle of URL and output for multiple queries?

Categories

Resources