Scraping profiles with Python and the "scrape-linkedin" package

Scraping profiles with Python and the "scrape-linkedin" package - python

I am trying to use the scrape_linkedin package. I follow the section on the github page on how to set up the package/LinkedIn li_at key (which I paste here for clarity).
Getting LI_AT
Navigate to www.linkedin.com and log in
Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
Find and copy the li_at value
Once I collect the li_at value from my LinkedIn, I run the following code:
from scrape_linkedin import ProfileScraper
with ProfileScraper(cookie='myVeryLong_li_at_Code_which_has_characters_like_AQEDAQNZwYQAC5_etc') as scraper:
profile = scraper.scrape(url='https://www.linkedin.com/in/justintrudeau/')
print(profile.to_dict())
I have two questions (I am originally an R user).
How can I input a list of profiles:
https://www.linkedin.com/in/justintrudeau/
https://www.linkedin.com/in/barackobama/
https://www.linkedin.com/in/williamhgates/
https://www.linkedin.com/in/wozniaksteve/
and scrape the profiles? (In R I would use the map function from the purrr package to apply the function to each of the LinkedIn profiles).
The output (from the original github page) is returned in a JSON style format. My second question is how I can convert this into a pandas data frame (i.e. it is returned similar to the following).
{'personal_info': {'name': 'Steve Wozniak', 'headline': 'Fellow at
Apple', 'company': None, 'school': None, 'location': 'San Francisco
Bay Area', 'summary': '', 'image': '', 'followers': '', 'email': None,
'phone': None, 'connected': None, 'websites': [],
'current_company_link': 'https://www.linkedin.com/company/sandisk/'},
'experiences': {'jobs': [{'title': 'Chief Scientist', 'company':
'Fusion-io', 'date_range': 'Jul 2014 – Present', 'location': 'Primary
Data', 'description': "I'm looking into future technologies applicable
to servers and storage, and helping this company, which I love, get
noticed and get a lead so that the world can discover the new amazing
technology they have developed. My role is principally a marketing one
at present but that will change over time.", 'li_company_url':
'https://www.linkedin.com/company/sandisk/'}, {'title': 'Fellow',
'company': 'Apple', 'date_range': 'Mar 1976 – Present', 'location': '1
Infinite Loop, Cupertino, CA 94015', 'description': 'Digital Design
engineer.', 'li_company_url': ''}, {'title': 'President & CTO',
'company': 'Wheels of Zeus', 'date_range': '2002 – 2005', 'location':
None, 'description': None, 'li_company_url':
'https://www.linkedin.com/company/wheels-of-zeus/'}, {'title':
'diagnostic programmer', 'company': 'TENET Inc.', 'date_range': '1970
– 1971', 'location': None, 'description': None, 'li_company_url':
''}], 'education': [{'name': 'University of California, Berkeley',
'degree': 'BS', 'grades': None, 'field_of_study': 'EE & CS',
'date_range': '1971 – 1986', 'activities': None}, {'name': 'University
of Colorado Boulder', 'degree': 'Honorary PhD.', 'grades': None,
'field_of_study': 'Electrical and Electronics Engineering',
'date_range': '1968 – 1969', 'activities': None}], 'volunteering':
[]}, 'skills': [], 'accomplishments': {'publications': [],
'certifications': [], 'patents': [], 'courses': [], 'projects': [],
'honors': [], 'test_scores': [], 'languages': [], 'organizations':
[]}, 'interests': ['Western Digital', 'University of Colorado
Boulder', 'Western Digital Data Center Solutions', 'NEW Homebrew
Computer Club', 'Wheels of Zeus', 'SanDisk®']}

Firstly, You can create a custom function to scrape data and use map function in Python to apply it over each profile link.
Secondly, to create a pandas dataframe using a dictionary, you can simply pass the dictionary to pd.DataFrame.
Thus to create a dataframe df, with dictionary dict, you can do like this:
df = pd.DataFrame(dict)

Related

Accessing a webpage within a webpage using BeautifulSoup?

I have written a Python script that parses the data of a webpage using beautifulsoup. What i want to do further is to click the NAME of each person on page, access their profile, then click on the website link on that page and scrape the email id ( if available ) from that website. Can anyone help me out with this? I am new to beautifulsoup and python so i am unable to proceed further. Any help is appreciated.
Thanks!
The kind of link i am working on is:
https://www.realtor.com/realestateagents/agentname-john
Here is my code:
from bs4 import BeautifulSoup
import requests
import csv
##################### Website
##################### URL
w_url = str('https://www.')+str(input('Please Enter Website URL :'))
####################### Number of
####################### Pages
pages = int(input(' Please specify number of pages: '))
####################### Range
####################### Specified
page_range = list(range(0,pages))
####################### WebSite
####################### Name ( in case of multiple websites )
#site_name = int(input('Enter the website name ( IN CAPITALS ) :'))
####################### Empty
####################### List
agent_info= []
####################### Creating
####################### CSV File
csv_file = open(r'D:\Webscraping\real_estate_agents.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name and Number'])
####################### FOR
####################### LOOP
for k in page_range:
website = requests.get(w_url+'/pg-'+'{}'.format(k)).text
soup = BeautifulSoup(website,'lxml')
class1 = 'jsx-1448471805 agent-name text-bold'
class2 = 'jsx-1448471805 agent-phone hidden-xs hidden-xxs'
for i in soup.find_all('div',class_=[[class1],[class2]]):
w = i.text
agent_info.append(w)
##################### Reomiving
##################### Duplicates
updated_info= list(dict.fromkeys(agent_info))
##################### Writing Data
##################### to CSV
for t in updated_info:
print(t)
csv_writer.writerow([t])
print('\n')
csv_file.close()

Would be more efficient (and less lines of code) if you grab the data from the api. It also appears the website emails are within that too, so if needed, no need to go to each of the 30,000+ websites for that email, so you can get it all in a fraction of the time.
The api also has all the data you'd want/need. For example, here's everythin on just 1 agent:
{'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'advertiser_id': 2121274, 'agent_rating': 5, 'background_photo': {'href': 'https://ap.rdcpix.com/1223152681/cc48579b6a0fe6ccbbf44d83e8f82145g-c0o.jpg'}, 'broker': {'fulfillment_id': 3860509, 'designations': [], 'name': 'BRIDGE REALTY, LLC.', 'accent_color': '', 'photo': {'href': ''}, 'video': ''}, 'description': 'As a professional real estate agent licensed in the State of Minnesota, I am committed to providing only the highest standard of care as I assist you in navigating the twists and turns of home ownership. Whether you are buying or selling your home, I will do everything it takes to turn your real estate goals and desires into a reality. If you are looking for a real estate Agent who will put your needs first and go above and beyond to help you reach your goals, I am the agent for you.', 'designations': [], 'first_month': 0, 'first_name': 'John', 'first_year': 2010, 'has_photo': True, 'href': 'http://www.twincityhomes4sale.com', 'id': '56b63efd7e54f7010021459d', 'is_realtor': True, 'languages': [], 'last_name': 'Palomino', 'last_updated': 'Mon, 04 Jan 2021 18:46:12 GMT', 'marketing_area_cities': [{'city_state': 'Columbus_MN', 'name': 'Columbus', 'state_code': 'MN'}, {'city_state': 'Blaine_MN', 'name': 'Blaine', 'state_code': 'MN'}, {'city_state': 'Circle Pines_MN', 'name': 'Circle Pines', 'state_code': 'MN'}, {'city_state': 'Lino Lakes_MN', 'name': 'Lino Lakes', 'state_code': 'MN'}, {'city_state': 'Lexington_MN', 'name': 'Lexington', 'state_code': 'MN'}, {'city_state': 'Forest Lake_MN', 'name': 'Forest Lake', 'state_code': 'MN'}, {'city_state': 'Chisago City_MN', 'name': 'Chisago City', 'state_code': 'MN'}, {'city_state': 'Wyoming_MN', 'name': 'Wyoming', 'state_code': 'MN'}, {'city_state': 'Centerville_MN', 'name': 'Centerville', 'state_code': 'MN'}, {'city_state': 'Hugo_MN', 'name': 'Hugo', 'state_code': 'MN'}, {'city_state': 'Grant_MN', 'name': 'Grant', 'state_code': 'MN'}, {'city_state': 'St. Anthony_MN', 'name': 'St. Anthony', 'state_code': 'MN'}, {'city_state': 'Arden Hills_MN', 'name': 'Arden Hills', 'state_code': 'MN'}, {'city_state': 'New Brighton_MN', 'name': 'New Brighton', 'state_code': 'MN'}, {'city_state': 'Mounds View_MN', 'name': 'Mounds View', 'state_code': 'MN'}, {'city_state': 'White Bear Township_MN', 'name': 'White Bear Township', 'state_code': 'MN'}, {'city_state': 'Vadnais Heights_MN', 'name': 'Vadnais Heights', 'state_code': 'MN'}, {'city_state': 'Shoreview_MN', 'name': 'Shoreview', 'state_code': 'MN'}, {'city_state': 'Little Canada_MN', 'name': 'Little Canada', 'state_code': 'MN'}, {'city_state': 'Columbia Heights_MN', 'name': 'Columbia Heights', 'state_code': 'MN'}, {'city_state': 'Hilltop_MN', 'name': 'Hilltop', 'state_code': 'MN'}, {'city_state': 'Fridley_MN', 'name': 'Fridley', 'state_code': 'MN'}, {'city_state': 'Linwood_MN', 'name': 'Linwood', 'state_code': 'MN'}, {'city_state': 'East Bethel_MN', 'name': 'East Bethel', 'state_code': 'MN'}, {'city_state': 'Spring Lake Park_MN', 'name': 'Spring Lake Park', 'state_code': 'MN'}, {'city_state': 'North St. Paul_MN', 'name': 'North St. Paul', 'state_code': 'MN'}, {'city_state': 'Maplewood_MN', 'name': 'Maplewood', 'state_code': 'MN'}, {'city_state': 'St. Paul_MN', 'name': 'St. Paul', 'state_code': 'MN'}], 'mls': [{'member': {'id': '506004321'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'A', 'primary': True}], 'nar_only': 1, 'nick_name': '', 'nrds_id': '506004321', 'office': {'name': 'Bridge Realty, Llc', 'mls': [{'member': {'id': '10982'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'O', 'primary': True}], 'phones': [{'ext': '', 'number': '(952) 368-0021', 'type': 'Home'}], 'phone_list': {'phone_1': {'type': 'Home', 'number': '(952) 368-0021', 'ext': ''}}, 'photo': {'href': ''}, 'slogan': '', 'website': None, 'video': None, 'fulfillment_id': 3027311, 'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'email': 'tony#thebridgerealty.com', 'nrds_id': None}, 'party_id': 23115328, 'person_name': 'John Palomino', 'phones': [{'ext': '', 'number': '(763) 458-0788', 'type': 'Mobile'}], 'photo': {'href': 'https://ap.rdcpix.com/900899898/cc48579b6a0fe6ccbbf44d83e8f82145a-c0o.jpg'}, 'recommendations_count': 2, 'review_count': 7, 'role': 'agent', 'served_areas': [{'name': 'Circle Pines', 'state_code': 'MN'}, {'name': 'Forest Lake', 'state_code': 'MN'}, {'name': 'Hugo', 'state_code': 'MN'}, {'name': 'St. Paul', 'state_code': 'MN'}, {'name': 'Minneapolis', 'state_code': 'MN'}, {'name': 'Wyoming', 'state_code': 'MN'}], 'settings': {'share_contacts': False, 'full_access': False, 'recommendations': {'realsatisfied': {'user': 'John-Palomino', 'id': '1073IJk', 'linked': '3d91C', 'updated': '1529551719'}}, 'display_listings': True, 'far_override': True, 'show_stream': True, 'terms_of_use': True, 'has_dotrealtor': False, 'display_sold_listings': True, 'display_price_range': True, 'display_ratings': True, 'loaded_from_sb': True, 'broker_data_feed_opt_out': False, 'unsubscribe': {'autorecs': False, 'recapprove': False, 'account_notify': False}, 'new_feature_popup_closed': {'agent_left_nav_avatar_to_profile': False}}, 'slogan': 'Bridging the gap between buyers & sellers', 'specializations': [{'name': '1st time home buyers'}, {'name': 'Residential Listings'}, {'name': 'Rental/Investment Properties'}, {'name': 'Move Up Buyers'}], 'title': 'Agent', 'types': 'agent', 'user_languages': [], 'web_url': 'https://www.realtor.com/realestateagents/John-Palomino_BLOOMINGTON_MN_2121274_876599394', 'zips': ['55014', '55025', '55038', '55112', '55126', '55421', '55449', '55092', '55434', '55109'], 'email': 'johnpalomino#live.com', 'full_name': 'John Palomino', 'name': 'John Palomino, Agent', 'social_media': {'facebook': {'type': 'facebook', 'href': 'https://www.facebook.com/Johnpalominorealestate'}}, 'for_sale_price': {'count': 1, 'min': 299900, 'max': 299900, 'last_listing_date': '2021-01-29T11:10:24Z'}, 'recently_sold': {'count': 35, 'min': 115000, 'max': 460000, 'last_sold_date': '2020-12-18'}, 'agent_team_details': {'is_team_member': False}}
Code:
import requests
import pandas as pd
import math
# Function to pull the data
def get_agent_info(jsonData, rows):
agents = jsonData['agents']
for agent in agents:
name = agent['person_name']
if 'email' in agent.keys():
email = agent['email']
else:
email = 'N/A'
if 'href' in agent.keys():
website = agent['href']
else:
website = 'N/A'
try:
office_data = agent['office']
office_email = office_data['email']
except:
office_email = 'N/A'
row = {'name':name, 'email':email, 'website':website, 'office_email':office_email}
rows.append(row)
return rows
rows = []
url = 'https://www.realtor.com/realestateagents/api/v3/search'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}
payload = {'nar_only': '1','offset': '','limit': '300','marketing_area_cities': '_',
'postal_code': '','is_postal_search': 'true','name': 'john','types': 'agent',
'sort': 'recent_activity_high','far_opt_out': 'false','client_id': 'FAR2.0',
'recommendations_count_min': '','agent_rating_min': '','languages': '',
'agent_type': '','price_min': '','price_max': '','designations': '',
'photo': 'true'}
# Gets 1st page, finds how many pages yoyu'll need to go through, and parses the data
jsonData = requests.get(url, headers=headers, params=payload).json()
total_matchs = jsonData['matching_rows']
total_pages = math.ceil(total_matchs/300)
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(1,total_pages))
# Iterate through next pages
for page in range(1,total_pages):
payload.update({'offset':page*300})
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(page+1,total_pages))
df = pd.DataFrame(rows)
Output: Just the first 10 rows of 30,600
print(df.head(10).to_string())
name email website office_email
0 John Croteau jcrot45#gmail.com https://www.facebook.com/JCtherealtor/ 1worcesterhomes#gmail.com
1 Stephanie St John sstjohn#shorewest.com https://stephaniestjohn.shorewest.com customercare#shorewest.com
2 Johnine Larsen info#realestategals.com http://realestategals.com seattle#northwestrealtors.com
3 Leonard Johnson americandreams#comcast.net http://www.adrhomes.net americandreams#comcast.net
4 John C Fitzgerald john#jcfhomes.com http://www.JCFHomes.com
5 John Vrsansky Jr John#OnTargetRealty.com http://www.OnTargetRealty.com john#ontargetrealty.com
6 John Williams jwilliamsidaho#gmail.com http://www.johnwilliamsidaho.com mpickford#kw.com
7 John Zeiter j.zeiter#ggsir.com info#ggsir.com
8 Mitch Johnson mitchjohnson1316#gmail.com miaroberson#creedrealty.com
9 John Lowe jplowe4#gmail.com http://johnlowegroup.com thedavisgrouponline#gmail.com

I have used requests(docs) instead of beautifulsoup, but still I tried to keep it as simple as possible
I have implement for the mentioned website specifically.I am filtering based on other attributes instead of class names and extracting the agent name from URL.
I am populating the set agentWebsites with required information in format (agentName, collection (tuple) of agentWebsite mentioned in their profile).
I am populating the set agentEmails with required information in format (agentName, collection (tuple) of emails mentioned in their websites).
I am not using a dict with agentName as key and
websites/emails as values since the agentName may not be unique and it can't be
used as a key.
Extracting email from websites:
Not all websites have email mentioned in them, some are dummy websites redirecting to some others and some have a form to fill our details to contact them instead of mentioning theirs.
handling exceptions:
Some websites are not accessible and they will be printed in output.
Some websites are taking lot of time to render, they are also being printed in output. you can increase the value of timeout_length global variable. when I tried, some websites with this error were getting rende
red for 200.
any other exceptions like Connection Error, etc will be handled by caught by last except and message will be printed to output.
Code:
from requests_html import HTMLSession, MaxRetries
from requests.exceptions import ConnectionError
import re
import sys
# Global values to store the links of individual agents, and their websites
agentLinks = set()
agentWebsites = set()
agentEmails = set()
session = HTMLSession()
timeout_length = 10
# urls used
start_url = "https://www.realtor.com/realestateagents/agentname-john"
base_url = "https://www.realtor.com"
# Regex to match emails from website
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
# no of pages to be scraped in website
no_of_pages = int(input("Enter no of pages to be scraped:\t"))
# scraping the links of agent profiles, page by page
for page in range(1, no_of_pages + 1):
r = session.get(start_url + '/pg-' + str(page))
# get all anchor tags
agentInfo = r.html.find('a')
for info in agentInfo:
# filter only agent profiles and extract links
if "href" in info.attrs and info.attrs["href"].startswith("/realestateagents/"):
agentLinks.add(info.attrs["href"])
print('page', page, 'agents found till now', len(agentLinks))
print('Total agents found till now', len(agentLinks))
# scrape the agentProfile page for the website link
print('---Scraping Website from agent Profile and email from agents Websites---')
agent_count = 0
total_agents = len(agentLinks)
for agentLink in agentLinks:
emails = set()
websites = set()
agentName = agentLink.replace("/realestateagents/", "").split('_')[0].replace('-', ' ').title()
# print the profile scraping progress
agent_count += 1
sys.stdout.write("\rscraping agent{0}'s profile".format(agent_count))
sys.stdout.flush()
r = session.get(base_url + agentLink)
# get all anchor tags
agentInfo = r.html.find('a')
for info in agentInfo:
# filter only website link and extract link
if "href" in info.attrs and "data-linkname" in info.attrs and info.attrs[
"data-linkname"] == "realtors:agent_details:contact_details:website":
agentWebsite = info.attrs["href"]
websites.add(agentWebsite)
if websites:
agentWebsites.add((agentName, tuple(websites)))
# print the email scraping progress
sys.stdout.write("\rscraping agent{0}'s websites for emails".format(agent_count))
sys.stdout.flush()
# scrape EMAILS in the websites
for website in websites:
try:
r = session.get(website)
r.html.render(timeout=timeout_length)
for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
if '/' not in re_match.group():
emails.add(re_match.group())
except ConnectionError:
print('\rcannot connect to', website)
except MaxRetries as mr:
print("\r", mr.message.replace('page.', website), sep='')
except:
print("\rUnexpected error for site", website, ":", sys.exc_info()[0])
finally:
# print the email scraping progress
sys.stdout.write("\rscraping agent{0}'s websites for emails".format(agent_count))
sys.stdout.flush()
# after scraping all websites, add all emails found
if emails:
agentEmails.add((agentName, tuple(emails)))
# agentWebsites is a set of tuples of format (agentName, agentWebsite url)
print("\r\nTotal Agent websites scraped", len(agentWebsites))
print(agentWebsites)
print("\nNo of agents with emails scraped", len(agentEmails))
print(agentEmails)
example output:
Enter no of pages to be scraped: 2
page 1 agents found till now 20
page 2 agents found till now 40
Total agents found till now 40
Scraping Website from agent Profile and email from agents Websites
cannot connect to https://www.david-johnston.kw.com
Unable to render the http://www.reefpointrealestate.com/ Try increasing timeout
cannot connect to http://www.patricia-johnson.com
Unable to render the http://palisadeshomes.com/ Try increasing timeout
Unexpected error for site https://www.jwhomesteam.com : <class 'pyppeteer.errors.NetworkError'>
cannot connect to http://www.stevenjohnson.org
cannot connect to http://www.johnrod.com/
cannot connect to http://www.rodneyjohnson.net
cannot connect to http://john.estatesoflasvegas.com
cannot connect to http://www.teamgoodell.com
cannot connect to http://Hilyardproperties.com
Total Agent websites scraped 32
{('John Mcnamara', ('http://www.ttrsir.com',)),... ('Don Johnson Pc', ('https://www.jwhomesteam.com',))}
No of agents with emails scraped 11
{('John Genovese And Richard Lester', ('connect#mycitycountry.com',)), ... ('John "Dan" Bethel', ('therealtygroupohio#gmail.com', 'danbethelteacher#gmail.com'))}
Note:
we can use r.html.find('a', containing='<text>') for filtering, but it didn't seem to work for me.

How to Match two APIs to update one API dataset using Python

I want to be able to GET information from API 1 and match it with API 2 and be able to update API 2's information with API 1. I am trying to figure out the most efficient/automated way to accomplish this as it also needs to be updated at a interval of every 10 minutes
I can query and get the results from API 1 this is my code and what my code looks like.
import json
import requests
myToken = '52c32f6588004cb3ab33b0ff320b8e4f'
myUrl = 'https://api1.com/api/v1/devices.json'
head = {'Authorization': 'Token {}'.format(myToken)}
response = requests.get(myUrl, headers=head)
r = json.loads(response.content)
r
The payload looks like this from API 1
{ "device" : {
"id": 153,
"battery_status" : 61,
"serial_no": "5QBYGKUI05",
"location_lat": "-45.948917",
"location_lng": "29.832179",
"location_address": "800 Laurel Rd, Lansdale, PA 192522,USA"}
}
I want to be able to take this information and match by "serial_no" and update all the other pieces of information for the corresponding device in API 2
I query the data for API 2 and this is what my code looks like
params = {
"location":'cf6707e3-f0ae-4040-a184-737b21a4bbd1',
"dateAdded":'ge:11/23/2020'}
url = requests.get('https://api2.com/api/assets',auth=('api2', '123456'), params=params)
r = json.loads(url.content)
r['items']
The JSON payload looks like this
[{'id': '064ca857-3783-460e-a7a2-245e054dcbe3',
'name': 'Apple Laptop 1',
'model': {'id': '50f5993e-2abf-49c8-86e0-8743dd58db6f',
'name': 'MacBook Pro'},
'manufacturer': {'id': 'f56244e2-76e3-46da-97dd-f72f92ca0779',
'name': 'APPLE'},
'room': {'id': '700ff2dc-0118-46c6-936a-01f0fa88c620',
'name': 'Storage Room 1',
'thirdPartyId': ''},
'location': {'id': 'cf6707e3-f0ae-4040-a184-737b21a4bbd1',
'name': 'Iron Mountain',
'thirdPartyId': ''},
'position': 'NonMounted',
'containerAsset': {'id': '00000000-0000-0000-0000-000000000000',
'name': None},
'baseAsset': {'id': '064ca857-3783-460e-a7a2-245e054dcbe3',
'name': 'Apple Laptop 1'},
'description': None,
'status': {'id': 'df9906d8-2856-45e3-9cba-bd7a1ac4971f',
'name': 'Production'},
'serialNumber': '5QBYGKUI06',
'tagNumber': None,
'alternateTagNumber': None,
'verificationStatus': {'id': 'cb3560a9-eef5-47b9-b033-394d3a09db18',
'name': 'Verified'},
'requiresRFID': False,
'requiresHangTag': False,
'bottomPosition': 0.0,
'leftPosition': 0.0,
'rackPosition': 'Front',
'labelX': None,
'labelY': None,
'verifyNameInRear': False,
'verifySerialNumberInRear': False,
'verifyBarcodeInRear': False,
'isNonDataCenter': False,
'rotate': False,
'customer': {'id': '00000000-0000-0000-0000-000000000000', 'name': None},
'thirdPartyId': '',
'temperature': None,
'dateLastScanned': None,
'placement': 'Floor',
'lastScannedLabelX': None,
'lastScannedLabelY': None,
'userDefinedValues': [{'userDefinedKeyId': '79e77a1e-4030-4308-a8ff-9caf40c04fbd',
'userDefinedKeyName': 'Longitude ',
'value': '-75.208917'},
{'userDefinedKeyId': '72c8056e-9b7d-40ac-9270-9f5929097e82',
'userDefinedKeyName': 'Address',
'value': '800 Laurel Rd, New York ,NY 19050, USA'},
{'userDefinedKeyId': '31aeeb91-daef-4364-8dd6-b0e3436d6a51',
'userDefinedKeyName': 'Battery Level',
'value': '67'},
{'userDefinedKeyId': '22b7ce4f-7d3d-4282-9ecb-e8ec2238acf2',
'userDefinedKeyName': 'Latitude',
'value': '35.932179'}]}
The documentation provided by API 2 tells me they only support PUT for updates as of right now but I would also want to know how I would do this using PATCH as it will be available in the future. So the data payload that I need to successful PUT is this
payload = {'id': '064ca857-3783-460e-a7a2-245e054dcbe3',
'name': 'Apple Laptop 1',
'model': {'id': '50f5993e-2abf-49c8-86e0-8743dd58db6f',
'name': 'MacBook Pro'},
'manufacturer': {'id': 'f56244e2-76e3-46da-97dd-f72f92ca0779',
'name': 'APPLE'},
'room': {'id': '700ff2dc-0118-46c6-936a-01f0fa88c620',
'name': 'Storage Room 1',
'thirdPartyId': ''},
'status': {'id': 'df9906d8-2856-45e3-9cba-bd7a1ac4971f',
'name': 'Production'},
'serialNumber': '5QBYGKUI06',
'verificationStatus': {'id': 'cb3560a9-eef5-47b9-b033-394d3a09db18',
'name': 'Verified'},
'requiresRFID': 'False',
'requiresHangTag': 'False',
'userDefinedValues': [{'userDefinedKeyId': '79e77a1e-4030-4308-a8ff-9caf40c04fbd',
'userDefinedKeyName': 'Longitude ',
'value': '-75.248920'},
{'userDefinedKeyId': '72c8056e-9b7d-40ac-9270-9f5929097e82',
'userDefinedKeyName': 'Address',
'value': '801 Laurel Rd, New York, Ny 192250, USA'},
{'userDefinedKeyId': '31aeeb91-daef-4364-8dd6-b0e3436d6a51',
'userDefinedKeyName': 'Battery Level',
'value': '67'},
{'userDefinedKeyId': '22b7ce4f-7d3d-4282-9ecb-e8ec2238acf2',
'userDefinedKeyName': 'Latitude',
'value': '29.782177'}]}
So apart of this is figuring out how I can query the json data portions that I need for the update
I am able to update the information using this line
requests.put('https://api2.com/api/assets/064ca857-3783-460e-a7a2-245e054dcbe3',auth=('API2', '123456'), data=json.dumps(payload))
but I need for it to dynamically update so I don't think the hard coded id parameter in the line will be efficient in a automation/efficiency standpoint. If anybody has any ideas, resources to point me in the right direction to know more about this process (I don't really know what it is even called) would be greatly appreciated.

Not entirely sure what you are trying to do here, but if you want to pull information nested in the responses you can do this.
Serial number from API 1
r['device']['serial_no']
Serial number for API 2
either r[0]['serialNumber'] or r['items'][0]['serialNumber'] depending on what you are showing
To modify the payload serial number, for example
payload['serialNumber'] = '123456abcdef'

Python Generators and how to iterate over correctly to drop records based on a key within the dictionary being present in a a separate list

I'm new to the concept of generators and I'm struggling with how to apply my changes to the records within the generator object returned from the RISparser module.
I understand that a generator only reads a record at a time and doesn't actually store the data in memory but I'm having a tough time iterating over it effectively and applying my changes.
My changes will involve dropping records that have not got ['doi'] values that are contained within a list of DOIs [doi_match].
doi_match = ['10.1002/14651858.CD008259.pub2','10.1002/14651858.CD011552','10.1002/14651858.CD011990']
Generator object returned form RISparser contains the following information, this is just the first 2 records returned of a few 100. I want to iterate over it and compare the 'doi': key from the generator with the list of DOIs.
{'type_of_reference': 'JOUR', 'title': "The CoRe Outcomes in WomeN's health (CROWN) initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Neurourology and Urodynamics', 'alternate_title1': 'Neurourol. Urodyn.', 'volume': '33', 'number': '8', 'start_page': '1176', 'end_page': '1177', 'year': '2014', 'doi': '10.1002/nau.22674', 'issn': '07332467 (ISSN)', 'authors': ['Khan, K.'], 'keywords': ['Bias (epidemiology)', 'Clinical trials', 'Consensus', 'Endpoint determination/standards', 'Evidence-based medicine', 'Guidelines', 'Research design/standards', 'Systematic reviews', 'Treatment outcome', 'consensus', 'editor', 'female', 'human', 'medical literature', 'Note', 'outcomes research', 'peer review', 'randomized controlled trial (topic)', 'systematic review (topic)', "women's health", 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'John Wiley and Sons Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: NEURE'], 'type_of_work': 'Note', 'name_of_database': 'Scopus', 'custom2': '25270392', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908368202&doi=10.1002%2fnau.22674&partnerID=40&md5=b220702e005430b637ef9d80a94dadc4'}
{'type_of_reference': 'JOUR', 'title': "The CROWN initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Gynecologic Oncology', 'alternate_title1': 'Gynecol. Oncol.', 'volume': '134', 'number': '3', 'start_page': '443', 'end_page': '444', 'year': '2014', 'doi': '10.1016/j.ygyno.2014.05.005', 'issn': '00908258 (ISSN)', 'authors': ['Karlan, B.Y.'], 'author_address': 'Gynecologic Oncology and Gynecologic Oncology Reports, India', 'keywords': ['clinical trial (topic)', 'decision making', 'Editorial', 'evidence based practice', 'female infertility', 'health care personnel', 'human', 'outcome assessment', 'outcomes research', 'peer review', 'practice guideline', 'premature labor', 'priority journal', 'publication', 'systematic review (topic)', "women's health", 'editorial', 'female', 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'Academic Press Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: GYNOA', 'Correspondence Address: Karlan, B.Y.; Gynecologic Oncology and Gynecologic Oncology ReportsIndia'], 'type_of_work': 'Editorial', 'name_of_database': 'Scopus', 'custom2': '25199578', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908351159&doi=10.1016%2fj.ygyno.2014.05.005&partnerID=40&md5=ab5a4d26d52c12d081e38364b0c79678'}
I tried iterating over the generator and applying the changes. But the records that have matches are not being placed in the match list.
match = []
for entry in ris_records:
if entry['doi'] in doi_match:
match.append(entry)
else:
del entry
any advice on how to iterate over a generator correctly, thanks.

Trying to generate a list from a JSON object (TypeError list indices must be integers or slices, not str)

I have retrieved a JSON object from an API. The JSON object looks like this:
{'copyright': 'Copyright (c) 2020 The New York Times Company. All Rights '
'Reserved.',
'response': {'docs': [{'_id': 'nyt://article/e3e5e5e5-1b32-5e2b-aea7-cf20c558dbd3',
'abstract': 'LEAD: RESEARCHERS at the Brookhaven '
'National Laboratory are employing a novel '
'model to study skin cancer in humans: '
'they are exposing tiny tropical fish to '
'ultraviolet radiation.',
'byline': {'organization': None,
'original': 'By Eric Schmitt',
'person': [{'firstname': 'Eric',
'lastname': 'Schmitt',
'middlename': None,
'organization': '',
'qualifier': None,
'rank': 1,
'role': 'reported',
'title': None}]},
'document_type': 'article',
'headline': {'content_kicker': None,
'kicker': None,
'main': 'Tiny Fish Help Solve Cancer '
'Riddle',
'name': None,
'print_headline': 'Tiny Fish Help Solve '
'Cancer Riddle',
'seo': None,
'sub': None},
'keywords': [{'major': 'N',
'name': 'organizations',
'rank': 1,
'value': 'Brookhaven National '
'Laboratory'},
{'major': 'N',
'name': 'subject',
'rank': 2,
'value': 'Ozone'},
{'major': 'N',
'name': 'subject',
'rank': 3,
'value': 'Radiation'},
{'major': 'N',
'name': 'subject',
'rank': 4,
'value': 'Cancer'},
{'major': 'N',
'name': 'subject',
'rank': 5,
'value': 'Research'},
{'major': 'N',
'name': 'subject',
'rank': 6,
'value': 'Fish and Other Marine Life'}],
'lead_paragraph': 'RESEARCHERS at the Brookhaven '
'National Laboratory are employing a '
'novel model to study skin cancer in '
'humans: they are exposing tiny '
'tropical fish to ultraviolet '
'radiation.',
'multimedia': [],
'news_desk': 'Science Desk',
'print_page': '3',
'print_section': 'C',
'pub_date': '1989-12-26T05:00:00+0000',
'section_name': 'Science',
'snippet': '',
'source': 'The New York Times',
'type_of_material': 'News',
'uri': 'nyt://article/e3e5e5e5-1b32-5e2b-aea7-cf20c558dbd3',
'web_url': 'https://www.nytimes.com/1989/12/26/science/tiny-fish-help-solve-cancer-riddle.html',
'word_count': 870},
{'_id': 'nyt://article/32a2431d-623a-525b-a21d-d401be865818',
'abstract': 'LEAD: Clouds, even the ones formed by '
...and continues like that, too long to show all of it here.
Now, when I want to list just one headline, I use:
pprint(articles['response']['docs'][0]['headline']['print_headline'])
And I get the output
'Tiny Fish Help Solve Cancer Riddle'
The problem is when I want to pick out all of the headlines from this JSON object, and make a list of them. I tried:
index = 0
for headline in articles:
headlineslist = ['response']['docs'][index]['headline']['print_headline'].split("''")
index = index + 1
headlineslist
But I get the error TypeError: list indices must be integers or slices, not str
In other words, it worked when I "listed" just one headline, at index [0], but not when I try to repeat the process over each index. How do I iterate through each index to get a list of outputs like the first one?

To iterate over the document list you can just do the following:
for doc in (articles['response']['docs']):
print(doc['headline']['print_headline'])
This would print all headlines.

API printing in terminal but not in HTML template

I am importing news API in my Django project. I can print my data in my terminal however I can't print through my news.html file. This could be an issue related to importing the data in HTML.
from django.shortcuts import render
import requests
def news(request):
url = ('https://newsapi.org/v2/top-headlines?'
'sources=bbc-news&'
'apiKey=647505e4506e425994ac0dc310221d04')
response = requests.get(url)
print(response.json())
news = response.json()
return render(request,'new/new.html',{'news':news})
base.html
<html>
<head>
<title></title>
</head>
<body>
{% block content %}
{% endblock %}
</body>
</html>
news.html
{% extends 'base.html' %}
{% block content %}
<h2>news API</h2>
{% if news %}
<p><strong>{{ news.title }}</strong><strong>{{ news.name}}</strong> public repositories.</p>
{% endif %}
{% endblock %}
Terminal and API Output
System check identified no issues (0 silenced).
November 28, 2018 - 12:31:07
Django version 2.1.3, using settings 'map.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.
{'status': 'ok', 'totalResults': 10, 'articles': [{'source':
{'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News',
'title': 'Sri Lanka defence chief held over murders',
'description': "The country's top officer is in custody, accused of covering up illegal killings in the civil war.", 'url': 'http://www.bbc.co.uk/news/world-asia-46374111', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/1010/production/_104521140_26571c51-e151-41b9-85a3-d6e441f5262b.jpg', 'publishedAt': '2018-11-28T12:12:05Z', 'content': "Image copyright AFP Image caption Adm Wijeguneratne denies the charges Sri Lanka's top military officer has been remanded in custody, accused of covering up civil war-era murders. Chief of Defence Staff Ravindra Wijeguneratne appeared in court after warrants … [+288 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Flash-flooding causes chaos in Sydney', 'description': "Emergency crews respond to hundreds of calls on the city's wettest November day since 1984.", 'url': 'http://www.bbc.co.uk/news/world-australia-46366961', 'urlToImage': 'https://ichef.bbci.co.uk/images/ic/1024x576/p06t1d6h.jpg', 'publishedAt': '2018-11-28T11:58:49Z', 'content': 'Media caption People in vehicles were among those caught up in the floods Sydney has been deluged by the heaviest November rain it has experienced in decades, causing flash-flooding, traffic chaos and power cuts. Heavy rain fell throughout Wednesday, the city… [+2282 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Rapist 'gets chance to see victim's child'", 'description': 'Sammy Woodhouse calls for a law change after rapist Arshid Hussain is given the chance to see his son.', 'url': 'http://www.bbc.co.uk/news/uk-england-south-yorkshire-46368991', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/12C94/production/_95184967_jessica.jpg', 'publishedAt': '2018-11-28T09:38:07Z', 'content': "Image caption Sammy Woodhouse's son was conceived when she was raped by Arshid Hussain A child exploitation scandal victim has called for a law change amid claims a man who raped her has been invited to play a role in her son's life. Arshid Hussain, who was j… [+2543 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'China chemical plant explosion kills 22', 'description': 'Initial reports say a vehicle carrying chemicals exploded while waiting to enter the north China plant.', 'url': 'http://www.bbc.co.uk/news/world-asia-46369041', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/2E1A/production/_104520811_mediaitem104520808.jpg', 'publishedAt': '2018-11-28T08:03:12Z', 'content': 'Image copyright AFP Image caption A line of burnt out vehicles could be seen outside the chemical plant At least 22 people have died and 22 more were injured in a blast outside a chemical factory in northern China. A vehicle carrying chemicals exploded while … [+1252 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Thousands told to flee Australia bushfire', 'description': 'Queensland\'s fire danger warning has been raised to "catastrophic" for the first time.', 'url': 'http://www.bbc.co.uk/news/world-australia-46366964', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/8977/production/_104519153_1ccd493b-4500-4d8d-9e6c-f32ba036dd3e.jpg', 'publishedAt': '2018-11-28T07:01:41Z', 'content': 'Image copyright EPA Image caption More than 130 bushfires are burning across Queensland, officials say Thousands of Australians have been told to evacuate their homes as a powerful bushfire threatens properties in Queensland. It follows the raising of the sta… [+974 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Chinese scientist defends 'gene-editing'", 'description': "He Jiankui shocked the world by claiming he had created the world's first genetically edited children.", 'url': 'http://www.bbc.co.uk/news/world-asia-china-46368731', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/7A23/production/_97176213_breaking_news_bigger.png', 'publishedAt': '2018-11-28T06:00:22Z', 'content': 'A Chinese scientist who claims to have created the world\'s first genetically edited babies has defended his work. Speaking at a genome summit in Hong Kong, He Jiankui, an associate professor at a Shenzhen university, said he was "proud" of his work. He said "… [+335 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Republican wins Mississippi Senate seat', 'description': "Cindy Hyde-Smith wins Mississippi's Senate race in a vote overshadowed by racial acrimony.", 'url': 'http://www.bbc.co.uk/news/world-us-canada-46361369', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/3A2B/production/_104519841_050855280.jpg', 'publishedAt': '2018-11-28T04:19:15Z', 'content': "Image copyright Reuters Image caption In her victory speech, Cindy Hyde-Smith promised to represent all Mississippians Republican Cindy Hyde-Smith has won Mississippi's racially charged Senate election, beating a challenge from the black Democrat, Mike Espy. … [+4327 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Lion Air should 'improve safety culture'", 'description': 'Indonesian authorities release a preliminary report into a crash in October that killed 189 people.', 'url': 'http://www.bbc.co.uk/news/world-asia-46121127', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/1762F/production/_104519759_45e74e27-2dc6-45dc-bded-405c057702f5.jpg', 'publishedAt': '2018-11-28T04:10:45Z', 'content': "Image copyright Reuters Image caption The families of the victims visited the site of the crash to pay tribute Indonesian authorities have recommended that budget airline Lion Air improve its safety culture, in a preliminary report into last month's deadly cr… [+1725 chars]"}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': "Trump 'may cancel Putin talks over Ukraine'", 'description': '"I don\'t like the aggression," the US leader says after Russia seizes Ukrainian boats off Crimea.', 'url': 'http://www.bbc.co.uk/news/world-europe-46367191', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/0C77/production/_104519130_050842389.jpg', 'publishedAt': '2018-11-28T01:08:30Z', 'content': 'Image copyright AFP Image caption Some of the detained Ukrainians have appeared in court in Crimea US President Donald Trump says he may cancel a meeting with Russian President Vladimir Putin following a maritime clash between Russia and Ukraine. Mr Trump tol… [+4595 chars]'}, {'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': 'BBC News', 'title': 'Wandering dog home after 2,200-mile adventure', 'description': "Sinatra the husky was found in Florida 18 months after vanishing in New York. Here's how he got home.", 'url': 'http://www.bbc.co.uk/news/world-us-canada-46353240', 'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/D49E/production/_104503445_p06t0kn9.jpg', 'publishedAt': '2018-11-27T21:47:59Z', 'content': "Video Sinatra the husky was found in Florida 18 months after vanishing in New York. Here's the remarkable story of how he got home."}]}
[28/Nov/2018 12:31:12] "GET / HTTP/1.1" 200 155

The data you get from that API doesn't have title or name as attributes at the top level. Rather, they are inside the articles element, which itself is a list.
{% for article in news.articles %}
<p><strong>{{ article.title }}</strong><strong>{{ article.source.name }}</strong> public repositories.</p>
{% endif %}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping profiles with Python and the "scrape-linkedin" package - python

Related

Accessing a webpage within a webpage using BeautifulSoup?

How to Match two APIs to update one API dataset using Python

Python Generators and how to iterate over correctly to drop records based on a key within the dictionary being present in a a separate list

Trying to generate a list from a JSON object (TypeError list indices must be integers or slices, not str)

API printing in terminal but not in HTML template

Categories

Resources