How to read the next page on API using python iterator? - python

There is an API that only produces one hundred results per page. I am trying to make a while loop so that it goes through all pages and takes results from all pages, but it does not work. I would be grateful if you could help me figure it out.
params = dict(
order_by='salary_desc',
text=keyword,
area=area,
period=30, # days
per_page=100,
page = 0,
no_magic='false', # disable magic
search_field='name' # available: name, description, company_name
)
response = requests.get(
BASE_URL + '/vacancies',
headers={'User-Agent': generate_user_agent()},
params=params,
)
response
items = response.json()['items']
vacancies = []
for item in items:
vacancies.append(dict(
id=item['id'],
name=item['name'],
salary_from=item['salary']['from'] if item['salary'] else None,
salary_to=item['salary']['to'] if item['salary'] else None,
currency = item['salary']['currency'] if item['salary'] else None,
created=item['published_at'],
company=item['employer']['name'],
area = item['area']['name'],
url=item['alternate_url']
))
I loop through the dictionary, if there is a result in the dictionary, I add +1 to the page parameter as an iterator:
while vacancies == True:
params['page'] += 1
Result in dictionary params ['page'] = zero remains (pages in API start at zero).
When calling params after starting the loop, the result is:
{'area': 1,
'no_magic': 'false',
'order_by': 'salary_desc',
'page': 0,
'per_page': 100,
'period': 30,
'search_field': 'name',
'text': '"python"'}
Perhaps I am doing the loop incorrectly, starting from the logic that while there is a result in the dictionary, the loop must be executed.

while vacancies == True: #
params['page'] += 1
will never evaluate to literal True regardless of it's contents. Python dict's; even thought they are Truthy They aren't True. You need to lessen the strictness of the statement.
if vacancies: # is truthy if it's len > 0, falsey otherwise
# Do something
Or you can explicitly check that it has content
if len(vacancies) > 0:
# Do something
This solves the problem of how to evaluate based on an object but doesn't solve the overall logic problem.
for _ in vacancies:
params["page"] += 1
# Does something for every item in vacancies
What you do each loop will depend on the problem and will require another question!
fixed below
params = dict(
order_by='salary_desc',
text=keyword,
area=area,
period=30, # days
per_page=100,
page = 0,
no_magic='false', # disable magic
search_field='name' # available: name, description, company_name
)
pages = []
while True:
params["page"] += 1
response = requests.get(BASE_URL + '/vacancies', headers={'User-Agent': generate_user_agent()}, params=params,)
items = response.json()['items']
if not items:
break
pages.append(items) # Do it for each page
Make vacancies for each page
results = []
for page in pages:
vacancies = []
for item in page:
vacancies.append(dict(
id=item['id'],
name=item['name'],
salary_from=item['salary']['from'] if item['salary'] else None,
salary_to=item['salary']['to'] if item['salary'] else None,
currency = item['salary']['currency'] if item['salary'] else None,
created=item['published_at'],
company=item['employer']['name'],
area = item['area']['name'],
url=item['alternate_url']
))
results.append(vacancies)
Results will be the fine list of all items.

vacancies is never True.
If you want to test on the boolean value of "vacancies" you could use bool(vacancies).
But with Python, you can use
while vacancies:
# some code logic
This way, Python will auto cast to bool your list.
If your list as something inside (len(your_list) > 0), bool(your_list) evaluatues to True, else it's False.
Also, instead of using dict(), you could write your dict this way:
params = {
'order_by': 'salary_desc',
'text':keyword,
'area': area,
'period': 30, # days
'per_page': 100,
'page': 0,
'no_magic': 'false', # disable magic
'search_field': 'name' # available: name, description, company_name
}
which is more pythonic.

Related

Python script producing Incorrect Results or AttributeError with NAN Values

I'm trying to scrape a site (discogs.com) for a few different fields (num_have, num_want, num_versions, num_for_sale, value) per release_id. Generally it works ok, but I want to set some conditions to exclude release ids where:
num_have is greater than 18,
num_versions is 2 or less,
num_for_sale is 5 or less,
So I want results to be any release id that meets all three conditions. I can do that for conditions 1 & 2, but the 3rd is giving me trouble. I don't know how to adjust for where num_for_sale is 0. According to the api documentation (https://www.discogs.com/developers/#page:marketplace,header:marketplace-release-statistics), the body should look like this:
{
"lowest_price": {
"currency": "USD",
"value": 2.09
},
"num_for_sale": 26,
"blocked_from_sale": false
}
and "Releases that have no items for sale in the marketplace will return a body with null data in the lowest_price and num_for_sale keys. Releases that are blocked for sale will also have null data for these keys." So I think my errors are coming from where num_for_sale is 0, the script doesn't know what when value. When I wrap the code that accesses market_data in a try-except block, and set the values for value and currency to None if an exception occurs, I get an AttributeError "NoneType' object has no attribute 'get'"
What am I doing wrong? How should I rewrite this code:
import pandas as pd
import requests
import time
import tqdm
unique_northAmerica = pd.read_pickle("/Users/EJ/northAmerica_df.pkl")
unique_northAmerica = unique_northAmerica.iloc[1:69]
headers = {'Authorization': 'Discogs key=MY-KEY'}
results = []
for index, row in tqdm.tqdm(unique_northAmerica.iterrows(), total=len(unique_northAmerica)):
release_id = row['release_id']
response = requests.get(f'https://api.discogs.com/releases/{release_id}', headers=headers)
data = response.json()
if 'community' in data:
num_have = data['community']['have']
num_want = data['community']['want']
else:
num_have = None
num_want = None
if "master_id" in data:
master_id = data['master_id']
response = requests.get(f"https://api.discogs.com/masters/{master_id}/versions", headers=headers)
versions_data = response.json()
if "versions" in versions_data:
num_versions = len(versions_data["versions"])
else:
num_versions = 1
else:
num_versions = 1
response = requests.get(f'https://api.discogs.com/marketplace/stats/{release_id}', headers=headers)
market_data = response.json()
num_for_sale = market_data.get('num_for_sale', None)
# Add the condition to only append to `results` if num_have <= 18 and num_versions <= 2
if num_have and num_versions and num_have <= 18 and num_versions <= 2:
if num_for_sale and num_for_sale <= 5:
if 'lowest_price' in market_data:
value = market_data['lowest_price'].get('value', None)
else:
value = None
else:
value = None
if num_for_sale == 0:
value = None
results.append({
'release_id': release_id,
'num_have': num_have,
'num_want': num_want,
'num_versions': num_versions,
'num_for_sale': num_for_sale,
'value': value
})
time.sleep(4)
df = pd.DataFrame(results)
df.to_pickle("/Users/EJ/example.pkl")
Thanks in advance!
I've tried wrapping the code that accesses market_data in a try-except block, and set the values for value and currency to None if an exception occurs, I get an AttributeError "NoneType' object has no attribute 'get'"
Edit:
Traceback (most recent call last)
Cell In [139], line 41
39 if num_for_sale <= 5:
40 if 'lowest_price' in market_data:
---> 41 value = market_data['lowest_price'].get('value', None)
42 else:
43 value = None
AttributeError: 'NoneType' object has no attribute 'get'
You just need to add a check to see if the data is None.
if 'lowest_price' in market_data and market_data['lowest_price'] is not None:
value = market_data['lowest_price'].get('value', None)
else:
value = None
In fact you can probably skip checking to see if lowest_price exists, because the api instructions tell you it will be there, it just might have null data.
So you could change it to.
if market_data['lowest_price']:
value = ...
else:
value = None
Per the discogs api docs:
Releases that have no items for sale in the marketplace will return a body with null data in the lowest_price and num_for_sale keys. Releases that are blocked for sale will also have null data for these keys.
Which means that in one of those situations the converted json would look like this:
{
"lowest_price": None,
"num_for_sale": None,
"blocked_from_sale": false
}
So when your code tries to call get on market_data['lowest_price'] what your actually doing is calling None.get which raises the error.
The reason why it is still including if num_for_sale > 5 is because you are appending the results regardless of whether your check returns true of false. To fix all you need to do is adjust the indentation on your results.append statement.
if num_have and num_versions and num_have <= 18 and num_versions <= 2:
if num_for_sale and num_for_sale <= 5:
if market_data['lowest_price']:
value = market_data['lowest_price'].get('value', None)
else:
value = None
results.append({
'release_id': release_id,
'num_have': num_have,
'num_want': num_want,
'num_versions': num_versions,
'num_for_sale': num_for_sale,
'value': value
})

key error when trying to split two entries in a dictionary in python

i have a dictionary with entries that have the ip and ports displayed like this
{'source': '192.168.4.1:80', 'destination': '168.20.10.1:443'}
but i want it to display it like
{'src_ip': '192.168.4.1', 'src_port': 80, 'dest_ip': '168.20.10.1', 'dest_port': 443}
so i want to split the first two entries into 4 new ones and delete the two old ones.
my code currently looks like this:
log entry = {'source': '192.168.4.1:80', 'destination': '168.20.10.1:443'}
def split_ip_port(log_entry):
u_source = log_entry['source']
if ':' in u_source:
src_list = u_source.split(':')
src_ip = src_list[0]
src_port = src_list[1]
log_entry.update({'src_ip': src_ip})
log_entry.update({'src_port': src_port})
del log_entry['source']
u_dest = log_entry['destination']
if ':' in u_dest:
dest_list = u_dest.split(':')
dest_ip = dest_list[0]
dest_port = dest_list[1]
print(dest_list)
log_entry.update({'dest_ip': dest_ip})
log_entry.update({'dest_port': dest_port})
del log_entry['destination']
return log_entry
when i try to test the source it gives me keyerror :'destination' and when i try to test the destination it gives me keyerror source. what is happening here?
When you split value (e.g., log_entry['source'].split(":") ) it returns list ['192.168.4.1','80']. Then you have to return value by index from list, [0] index in list is '192.168.4.1'. Then you have to assign it to new key in your dict, log_entry['src_ip']
log_entry['src_ip'] = log_entry['source'].split(":")[0]
log_entry['src_port'] = log_entry['source'].split(":")[1]
log_entry['dest_ip'] = log_entry['destination'].split(":")[0]
log_entry['dest_port'] = log_entry['destination'].split(":")[1]
del log_entry['source']
del log_entry['destination']
Since the original code work. Here just an offer to simplify the original code - you could try to split the source/destination and ports then just create a new dictionary like this way:
orig_dc = {'source': '192.168.4.1:80', 'destination': '168.20.10.1:443'}
new_dc = {}
for k, v in orig_dc.items():
orig, port = v.split(':')
if k in 'source':
new_dc.setdefault('src_ip', orig)
new_dc.setdefault('src_port', int(port))
else:
new_dc.setdefault('dest_ip', orig)
new_dc.setdefault('dest_port', int(port))
expected = { 'src_ip': '192.168.4.1', 'src_port': 80,
'dest_ip': '168.20.10.1', 'dest_port': 443}
assert new_dc == expected

Python Json value overwritten by last value

lista =
[{Identity: joe,
summary:[
{distance: 1, time:2, status: idle},
{distance:2, time:5, status: moving}],
{unit: imperial}]
I can pull the data easily and put in pandas. The issue is, if an identity has multiple instances of, say idle, it takes the last value, instead of summing together.
my code...
zdrivershours = {}
zdistance = {}
zstophours = {}
For driver in resp:
driverid[driver['AssetID']] = driver['AssetName']
for value in [driver['SegmentSummary']]:
for value in value:
if value['SegmentType'] == 'Motion':
zdriverhours[driver['AssetID']] = round(value['Time']/3600,2)
if value['SegmentType'] == 'Stop':
zstophours[driver['AssetID']] = round(value['IdleTime']/3600,2)
zdistance[driver['AssetID']] = value['Distance']
To obtain the summatory of distance for every driver replace:
zdistance[driver['AssetID']] = value['Distance']
by
if driver['AssetID'] in zdistance:
zdistance[driver['AssetID']] = zdistance[driver['AssetID']] + value['Distance']
else:
zdistance[driver['AssetID']] = value['Distance']

Python Order Dictionary in Chronological Order even if keys are different

I am trying to make an RSS feed composed of different sources and I would like them to be sorted by newest date, rather than the source itself. I store all of my news in one python dictionary, regardless of its source:
feed = None
if sports['nhl'] == 1:
feed = newsParse('nhl')
allOff = False
if sports['nba'] == 1:
feed = newsParse('nba')
allOff = False
if sports['nfl'] == 1:
feed = newsParse('nfl')
allOff = False
if sports['mlb'] == 1:
feed = newsParse('mlb')
allOff = False
The function looks like this:
def newsParse(league):
rss_url = 'https://www.espn.com/espn/rss/' + league + '/news'
parser = feedparser.parse(rss_url)
newsInfo = {
'title': [],
'link': [],
'description': [],
'date': []
}
for entry in parser.entries:
newsInfo['title'].append(entry.title)
newsInfo['description'].append(entry.description)
newsInfo['link'].append(entry.links[0].href)
newsInfo['date'].append(entry.published)
return newsInfo
If I print out 'feed' I get all of the titles sorted by source, then all of the descriptions sorted by source, and etc. The ['date'] data looks like this:
Fri, 24 Jul 2020 09:35:08 EST'
How can I sort all of my values in chronological order, whilst keeping the titles, descriptions, and links together?
Why not save the entries as a list of dictionaries ?
For example:
def newsParse(league):
rss_url = 'https://www.espn.com/espn/rss/' + league + '/news'
parser = feedparser.parse(rss_url)
newsInfo = []
for entry in parser.entries:
newEntry = {'title': entry.title,
'description': entry.description,
'link': entry.link,
'date': entry.date}
newsInfo.append(newEntry)
return newsInfo
newsInfo will be a list of dictionaries,
and you can sort that list using this line of code:
sorted(newsInfo, key=lambda k: k['date'])
If the date from the RSS feed is a string,
I think you should convert it to python's datetime type for the sorting to work.
Edit (answer for comment):
If you need a single list with all the leagues,
you can use this code:
feed = []
if sports['nhl'] == 1:
feed.extend(newsParse('nhl'))
allOff = False
if sports['nba'] == 1:
feed.extend(newsParse('nba'))
allOff = False
if sports['nfl'] == 1:
feed.extend(newsParse('nfl'))
allOff = False
if sports['mlb'] == 1:
feed.extend(newsParse('mlb'))
allOff = False
After feed contains all the data you need,
you can sort it by date:
sorted(feed, key=lambda k: k['date'])

Parsing Security Matrix Spreadsheet - NoneType is not Iterable

Trying to Nest no's and yes's with their respective applications and services.
That way when a request comes in for a specific zone to zone sequence, a check can be run against this logic to verify accepted requests.
I have tried calling Decision_List[Zone_Name][yes_no].update and i tried ,append when it was a list type and not a dict but there is no update method ?
Base_Sheet = range(5, sh.ncols)
Column_Rows = range(1, sh.nrows)
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:{}}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(3)[rows]
dst_object = sh.col_values(2)[rows]
src_object = sh.col_values(1)[rows]
yes_no = sh.col_values(colnum)[rows]
if yes_no not in Decision_List[Zone_Name]:
Decision_List[Zone_Name][yes_no] = [app_object]
else:
Decision_List[Zone_Name]=[yes_no].append(app_object)
I would like it present info as follows
Decision_List{Zone_Name:{yes:[ssh, ssl, soap], no:
[web-browsing,facebook]}}
I would still like to know why i couldnt call the append method on that specific yes_no key whos value was a list.
But in the mean time, i made a work around of sorts. I created a set as the key and gave the yes_no as the value. this will allow me to pair many no type values with the keys being a set of the application, port, service, etc.. and then i can search for yes values and create additional dicts out of them for logic.
Any better ideas out there i am all ears.
for rownum in range(0, sh.nrows):
#row_val is all the values in the row of cell.index[rownum] as determined by rownum
row_val = sh.row_values(rownum)
col_val = sh.col_values(rownum)
print rownum, col_val[0], col_val[1: CoR]
header.append({col_val[0]: col_val[1: CoR]})
print header[0]['Start Port']
dec_tree = {}
count = 1
Base_Sheet = range(5, sh.ncols)
Column_Rows = range(1, sh.nrows)
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:{}}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(3)[rows]
dst_object = sh.col_values(2)[rows]
src_object = sh.col_values(1)[rows]
yes_no = sh.col_values(colnum)[rows]
for rule_name in Decision_List.iterkeys():
Decision_List[Zone_Name][(app_object, svc_object, proto_object)]= yes_no
Thanks again.
I think still a better way is to use collections.defaultdict
In this manner it will ensure that i am able to append to the specific yes_no as i had originally intended.
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:defaultdict(list)}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(2)[rows]
dst_object = sh.col_values(1)[rows]
src_object = sh.col_values(0)[rows]
yes_no = sh.col_values(colnum)[rows]
if yes_no not in Decision_List[Zone_Name]:
Decision_List[Zone_Name][yes_no]= [app_object, svc_object, proto_object, dst_object, src_object]
else:
Decision_List[Zone_Name][yes_no].append([(app_object, svc_object, proto_object,dst_object, src_object)])
This allows me to then set the values as a set and append them as needed

Categories

Resources