I want to scrape multiple Google scholar user profiles - publications, journals, citations etc. I have already written the python code for scraping a user profile given the url. Now, suppose I have 100 names and the corresponding urls in an excel file like this.
name link
Autor https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en
Dorn https://scholar.google.com/citations?user=w3Dri00AAAAJ&hl=en
Hanson https://scholar.google.com/citations?user=nMtHiQsAAAAJ&hl=en
Borjas https://scholar.google.com/citations?user=Patm-BEAAAAJ&hl=en
....
My question is can I read the 'link' column of this file and write a for loop for the urls so that I can scrape each of these profiles and append the results in the same file. I seems a bit far fetched but I hope there is a way to do so. Thanks in advance!
You can use pandas.read_csv() to read a specific file from a csv. For example:
import pandas as pd
df = pd.read_csv('data.csv')
arr = []
link_col = df['link']
for i in link_col:
arr.append(i);
print(arr)
This would allow you extract only the link column and append each value into your array. If you'd like you learn more, you can refer to pandas.
I hope it is not too advanced for you
1 Create a class for your pages
class Pages:
def __init__(self, name=None, link=None):
self.name = name
self.link = link
2 Create pages list
pages = []
3 Find rows locator, like:
rows = driver.find_elements_by_css_selector("your_selector")
rows number must be the same as the number of rows in you table. For example, the you have to items in the list, the rows number will be 20.
4 Get each row value:
for row in rows:
name = row.find_element_by_css_selector("here is a unique selector for each data field for name").text
link = row.find_element_by_css_selector("here is a unique selector for each data field for link").text
5 Create pages object:
page = Page(name=name,link=link)
6 Put all rows to the list:
pages.append(page)
Result
A list of pages (object page) where the first row will be accessible with pages[0], second with pages[1] and so on.
P.S
If you are having trouble with selectors, as them as different questions.
I think I explained the concept to you, so you will be able to start.
To read data from an Excel file you can use pandas read_excel() method like so:
# https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
authors_df = pd.read_excel("google_scholar_scrape_multiple_authors.xlsx", sheet_name="authors") # sheet_name is optional in this case
print(authors_df["author_link"])
'''
0 https://scholar.google.com/citations?hl=en&use...
1 https://scholar.google.com/citations?hl=en&use...
2 https://scholar.google.com/citations?hl=en&use...
3 https://scholar.google.com/citations?hl=en&use...
4 https://scholar.google.com/citations?hl=en&use...
Name: author_link, dtype: object
'''
print(authors_df)
'''
author_name author_link
0 Masatoshi Nei https://scholar.google.com/citations?hl=en&use...
1 Ulrich K. Laemmli https://scholar.google.com/citations?hl=en&use...
2 Gene Myers https://scholar.google.com/citations?hl=en&use...
3 Sudhir Kumar https://scholar.google.com/citations?hl=en&use...
4 Irving Weissman https://scholar.google.com/citations?hl=en&use...
'''
To scrape from multiple authors you can use a for loop to iterate over ["author_link"] and extract desired data using beautifulsoup, lxml, and requests libraries.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
import pandas as pd
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}
# https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
authors_df = pd.read_excel("google_scholar_scrape_multiple_authors.xlsx", sheet_name="authors") # sheet_name is optional in this case
# to_list() returns a list of author links so we can iterate over them
for author_link in authors_df["author_link"].to_list():
html = requests.get(author_link, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
print(f"Currently extracting: {soup.select_one('#gsc_prf_in').text}")
author_email = soup.select_one("#gsc_prf_ivh").text
author_image = f'https://scholar.google.com{soup.select_one("#gsc_prf_pup-img")["src"]}'
print(author_image, f"Author email: {author_email}", sep="\n")
# iterating over container with all needed data by accessing the right CSS selector
# have a look at SelectorGadget Chrome extension to easily grab CSS selectors
for article in soup.select("#gsc_a_b .gsc_a_t"):
article_title = article.select_one(".gsc_a_at").text
article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
article_publication = article.select_one(".gs_gray+ .gs_gray").text
print(article_title, article_link, article_authors, article_publication, sep="\n")
print("-" * 15)
# part of the output:
'''
Currently extracting: Masatoshi Nei
https://scholar.google.comhttps://scholar.googleusercontent.com/citations?view_op=view_photo&user=VxOmZDgAAAAJ&citpid=3
Author email: Verified email at temple.edu
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:u5HHmVD_uO8C
N Saitou, M Nei
Molecular biology and evolution 4 (4), 406-425, 1987
... other results
---------------
Currently extracting: Irving Weissman
https://scholar.google.com/citations/images/avatar_scholar_128.png
Author email: Verified email at stanford.edu
Stem cells, cancer, and cancer stem cells
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Y66bJgUAAAAJ&citation_for_view=Y66bJgUAAAAJ:u5HHmVD_uO8C
T Reya, SJ Morrison, MF Clarke, IL Weissman
nature 414 (6859), 105-111, 2001
'''
Alternatively, you can achieve the same thing using Google Scholar Author API from SeprApi. It's a paid API with a free plan.
Essentially, you only need to grab the data you want from the received dictionary without the need to figure out what selectors to use in order to scrape the proper data, how to bypass blocks from Google, how to increase the number of requests.
Example code to integrate:
import re
import pandas as pd
from serpapi import GoogleSearch
authors_df = pd.read_excel("google_scholar_scrape_multiple_authors.xlsx", sheet_name="authors") # sheet_name is optional in this case
for author in authors_df["author_link"].to_list():
params = {
"api_key": "YOUR_API_KEY",
"engine": "google_scholar_author",
"hl": "en",
# using basic regular expression to grab user ID from the passed URL
"author_id": re.search(r"user=(.*)", author).group(1) # -> VxOmZDgAAAAJ, unique author ID from the URL
}
search = GoogleSearch(params)
results = search.get_dict()
print(f"Extracting data from: {results['author']['name']}\n"
f"Author info: {results['author']}\n\n"
f"Author articles:\n{results['articles']}\n")
# part of the output:
'''
Extracting data from: Masatoshi Nei
Author info: {'name': 'Masatoshi Nei', 'affiliations': 'Laura Carnell Professor of Biology, Temple University', 'email': 'Verified email at temple.edu', 'interests': [{'title': 'Evolution', 'link': 'https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:evolution', 'serpapi_link': 'https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aevolution'}, {'title': 'Evolutionary biology', 'link': 'https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:evolutionary_biology', 'serpapi_link': 'https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aevolutionary_biology'}, {'title': 'Molecular evolution', 'link': 'https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:molecular_evolution', 'serpapi_link': 'https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Amolecular_evolution'}, {'title': 'Population genetics', 'link': 'https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:population_genetics', 'serpapi_link': 'https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apopulation_genetics'}, {'title': 'Phylogenetics', 'link': 'https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:phylogenetics', 'serpapi_link': 'https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aphylogenetics'}], 'thumbnail': 'https://scholar.googleusercontent.com/citations?view_op=view_photo&user=VxOmZDgAAAAJ&citpid=3'}
Author articles:
[{'title': 'The neighbor-joining method: a new method for reconstructing phylogenetic trees.', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:u5HHmVD_uO8C', 'citation_id': 'VxOmZDgAAAAJ:u5HHmVD_uO8C', 'authors': 'N Saitou, M Nei', 'publication': 'Molecular biology and evolution 4 (4), 406-425, 1987', 'cited_by': {'value': 64841, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7672721046330422437,346314157833338191', 'serpapi_link': 'https://serpapi.com/search.json?cites=7672721046330422437%2C346314157833338191&engine=google_scholar&hl=en', 'cites_id': '7672721046330422437,346314157833338191'}, 'year': '1987'}, {'title': 'MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:Tyk-4Ss8FVUC', 'citation_id': 'VxOmZDgAAAAJ:Tyk-4Ss8FVUC', 'authors': 'K Tamura, D Peterson, N Peterson, G Stecher, M Nei, S Kumar', 'publication': 'Molecular biology and evolution 28 (10), 2731-2739, 2011', 'cited_by': {'value': 44316, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=5624029996178252455,5910675136328950108,13537318717249213469', 'serpapi_link': 'https://serpapi.com/search.json?cites=5624029996178252455%2C5910675136328950108%2C13537318717249213469&engine=google_scholar&hl=en', 'cites_id': '5624029996178252455,5910675136328950108,13537318717249213469'}, 'year': '2011'}, {'title': 'MEGA6: molecular evolutionary genetics analysis version 6.0', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:qmtmRrLr0tkC', 'citation_id': 'VxOmZDgAAAAJ:qmtmRrLr0tkC', 'authors': 'K Tamura, G Stecher, D Peterson, A Filipski, S Kumar', 'publication': 'Molecular biology and evolution 30 (12), 2725-2729, 2013', 'cited_by': {'value': 40558, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=5258359186493639031', 'serpapi_link': 'https://serpapi.com/search.json?cites=5258359186493639031&engine=google_scholar&hl=en', 'cites_id': '5258359186493639031'}, 'year': '2013'}, {'title': 'MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:u-x6o8ySG0sC', 'citation_id': 'VxOmZDgAAAAJ:u-x6o8ySG0sC', 'authors': 'K Tamura, J Dudley, M Nei, S Kumar', 'publication': 'Molecular biology and evolution 24 (8), 1596-1599, 2007', 'cited_by': {'value': 34245, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=8480751610153565117', 'serpapi_link': 'https://serpapi.com/search.json?cites=8480751610153565117&engine=google_scholar&hl=en', 'cites_id': '8480751610153565117'}, 'year': '2007'}, {'title': 'Molecular evolutionary genetics', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:d1gkVwhDpl0C', 'citation_id': 'VxOmZDgAAAAJ:d1gkVwhDpl0C', 'authors': 'M Nei', 'publication': 'Columbia university press, 1987', 'cited_by': {'value': 20704, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7660515423132980153', 'serpapi_link': 'https://serpapi.com/search.json?cites=7660515423132980153&engine=google_scholar&hl=en', 'cites_id': '7660515423132980153'}, 'year': '1987'}, {'title': 'MEGA2: molecular evolutionary genetics analysis software', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:IjCSPb-OGe4C', 'citation_id': 'VxOmZDgAAAAJ:IjCSPb-OGe4C', 'authors': 'S Kumar, K Tamura, IB Jakobsen, M Nei', 'publication': 'Bioinformatics 17 (12), 1244-1245, 2001', 'cited_by': {'value': 16078, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=14171206204658643394,531426008085525562,5869149036159079676,8067244568899724142,12929609819447339488,15783386726452728786', 'serpapi_link': 'https://serpapi.com/search.json?cites=14171206204658643394%2C531426008085525562%2C5869149036159079676%2C8067244568899724142%2C12929609819447339488%2C15783386726452728786&engine=google_scholar&hl=en', 'cites_id': '14171206204658643394,531426008085525562,5869149036159079676,8067244568899724142,12929609819447339488,15783386726452728786'}, 'year': '2001'}, {'title': 'Estimation of average heterozygosity and genetic distance from a small number of individuals', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:2osOgNQ5qMEC', 'citation_id': 'VxOmZDgAAAAJ:2osOgNQ5qMEC', 'authors': 'M Nei', 'publication': 'Genetics 89 (3), 583-590, 1978', 'cited_by': {'value': 14504, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=11038674224870321151', 'serpapi_link': 'https://serpapi.com/search.json?cites=11038674224870321151&engine=google_scholar&hl=en', 'cites_id': '11038674224870321151'}, 'year': '1978'}, {'title': 'MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:9yKSN-GCB0IC', 'citation_id': 'VxOmZDgAAAAJ:9yKSN-GCB0IC', 'authors': 'S Kumar, K Tamura, M Nei', 'publication': 'Briefings in bioinformatics 5 (2), 150-163, 2004', 'cited_by': {'value': 14403, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=10013295782066828040,15148316572039251274', 'serpapi_link': 'https://serpapi.com/search.json?cites=10013295782066828040%2C15148316572039251274&engine=google_scholar&hl=en', 'cites_id': '10013295782066828040,15148316572039251274'}, 'year': '2004'}, {'title': 'Mathematical model for studying genetic variation in terms of restriction endonucleases', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:qjMakFHDy7sC', 'citation_id': 'VxOmZDgAAAAJ:qjMakFHDy7sC', 'authors': 'M Nei, WH Li', 'publication': 'Proceedings of the National Academy of Sciences 76 (10), 5269-5273, 1979', 'cited_by': {'value': 13619, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=5179626164554275201,1942230974501958280', 'serpapi_link': 'https://serpapi.com/search.json?cites=5179626164554275201%2C1942230974501958280&engine=google_scholar&hl=en', 'cites_id': '5179626164554275201,1942230974501958280'}, 'year': '1979'}, {'title': 'Genetic distance between populations', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:UeHWp8X0CEIC', 'citation_id': 'VxOmZDgAAAAJ:UeHWp8X0CEIC', 'authors': 'M Nei', 'publication': 'The American Naturalist 106 (949), 283-292, 1972', 'cited_by': {'value': 12980, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=4154924214026252226,7115074001272219295', 'serpapi_link': 'https://serpapi.com/search.json?cites=4154924214026252226%2C7115074001272219295&engine=google_scholar&hl=en', 'cites_id': '4154924214026252226,7115074001272219295'}, 'year': '1972'}, {'title': 'Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees.', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:Y0pCki6q_DkC', 'citation_id': 'VxOmZDgAAAAJ:Y0pCki6q_DkC', 'authors': 'K Tamura, M Nei', 'publication': 'Molecular biology and evolution 10 (3), 512-526, 1993', 'cited_by': {'value': 11093, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=13509507708085673250', 'serpapi_link': 'https://serpapi.com/search.json?cites=13509507708085673250&engine=google_scholar&hl=en', 'cites_id': '13509507708085673250'}, 'year': '1993'}, {'title': 'Analysis of gene diversity in subdivided populations', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:zYLM7Y9cAGgC', 'citation_id': 'VxOmZDgAAAAJ:zYLM7Y9cAGgC', 'authors': 'M Nei', 'publication': 'Proceedings of the national academy of sciences 70 (12), 3321-3323, 1973', 'cited_by': {'value': 10714, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=11712109356391350421', 'serpapi_link': 'https://serpapi.com/search.json?cites=11712109356391350421&engine=google_scholar&hl=en', 'cites_id': '11712109356391350421'}, 'year': '1973'}, {'title': 'Molecular evolution and phylogenetics', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:YsMSGLbcyi4C', 'citation_id': 'VxOmZDgAAAAJ:YsMSGLbcyi4C', 'authors': 'M Nei, S Kumar', 'publication': 'Oxford University Press, USA, 2000', 'cited_by': {'value': 8795, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=703217195301701212,1351927951694611906', 'serpapi_link': 'https://serpapi.com/search.json?cites=703217195301701212%2C1351927951694611906&engine=google_scholar&hl=en', 'cites_id': '703217195301701212,1351927951694611906'}, 'year': '2000'}, {'title': 'Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions.', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:W7OEmFMy1HYC', 'citation_id': 'VxOmZDgAAAAJ:W7OEmFMy1HYC', 'authors': 'M Nei, T Gojobori', 'publication': 'Molecular biology and evolution 3 (5), 418-426, 1986', 'cited_by': {'value': 5279, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=12106160511321461626', 'serpapi_link': 'https://serpapi.com/search.json?cites=12106160511321461626&engine=google_scholar&hl=en', 'cites_id': '12106160511321461626'}, 'year': '1986'}, {'title': 'Prospects for inferring very large phylogenies by using the neighbor-joining method', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:0EnyYjriUFMC', 'citation_id': 'VxOmZDgAAAAJ:0EnyYjriUFMC', 'authors': 'K Tamura, M Nei, S Kumar', 'publication': 'Proceedings of the National Academy of Sciences 101 (30), 11030-11035, 2004', 'cited_by': {'value': 4882, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=9650987578903829104', 'serpapi_link': 'https://serpapi.com/search.json?cites=9650987578903829104&engine=google_scholar&hl=en', 'cites_id': '9650987578903829104'}, 'year': '2004'}, {'title': 'The bottleneck effect and genetic variability in populations', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:eQOLeE2rZwMC', 'citation_id': 'VxOmZDgAAAAJ:eQOLeE2rZwMC', 'authors': 'M Nei, T Maruyama, R Chakraborty', 'publication': 'Evolution, 1-10, 1975', 'cited_by': {'value': 3906, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=13149273985544466189', 'serpapi_link': 'https://serpapi.com/search.json?cites=13149273985544466189&engine=google_scholar&hl=en', 'cites_id': '13149273985544466189'}, 'year': '1975'}, {'title': 'Accuracy of estimated phylogenetic trees from molecular data', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:Se3iqnhoufwC', 'citation_id': 'VxOmZDgAAAAJ:Se3iqnhoufwC', 'authors': 'M Nei, F Tajima, Y Tateno', 'publication': 'Journal of molecular evolution 19 (2), 153-170, 1983', 'cited_by': {'value': 2877, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=10638180566709737898', 'serpapi_link': 'https://serpapi.com/search.json?cites=10638180566709737898&engine=google_scholar&hl=en', 'cites_id': '10638180566709737898'}, 'year': '1983'}, {'title': 'Molecular population genetics and evolution.', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:WF5omc3nYNoC', 'citation_id': 'VxOmZDgAAAAJ:WF5omc3nYNoC', 'authors': 'M Nei', 'publication': 'Molecular population genetics and evolution., 1975', 'cited_by': {'value': 2795, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=7886550802885580479', 'serpapi_link': 'https://serpapi.com/search.json?cites=7886550802885580479&engine=google_scholar&hl=en', 'cites_id': '7886550802885580479'}, 'year': '1975'}, {'title': 'Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:roLk4NBRz8UC', 'citation_id': 'VxOmZDgAAAAJ:roLk4NBRz8UC', 'authors': 'AL Hughes, M Nei', 'publication': 'Nature 335 (6186), 167-170, 1988', 'cited_by': {'value': 2169, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=2966744676732650646', 'serpapi_link': 'https://serpapi.com/search.json?cites=2966744676732650646&engine=google_scholar&hl=en', 'cites_id': '2966744676732650646'}, 'year': '1988'}, {'title': 'Sampling variances of heterozygosity and genetic distance', 'link': 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=VxOmZDgAAAAJ&citation_for_view=VxOmZDgAAAAJ:UebtZRa9Y70C', 'citation_id': 'VxOmZDgAAAAJ:UebtZRa9Y70C', 'authors': 'M Nei, AK Roychoudhury', 'publication': 'Genetics 76 (2), 379-390, 1974', 'cited_by': {'value': 1918, 'link': 'https://scholar.google.com/scholar?oi=bibs&hl=en&cites=5978996318059495400', 'serpapi_link': 'https://serpapi.com/search.json?cites=5978996318059495400&engine=google_scholar&hl=en', 'cites_id': '5978996318059495400'}, 'year': '1974'}]
'''
Disclaimer, I work for SerpApi.
Related
Hi I have written a script which uses BeautifulSoup4 to extract list of jobs as well as their details and associated application links. I have used a for loop as each value (Link/Title/Company etc) as each piece of information is under a different class.
I have managed to write for loops to extract all of the data however not sure how to link the first result in the 1st for loop (Link) to pair with the 1st result in the second for loop (Job Title) and so on.
So my output is currently:
(There are 50 jobs on the search)
First 50 lines : Links of the application
Second 50 lines : Names of each job title
etc etc.
import requests
import json
from bs4 import BeautifulSoup
URL = "https://remote.co/remote-jobs/developer/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
jobs = soup.find_all('a', class_='card m-0 border-left-0 border-right-0 border-top-0 border-bottom')
titles = soup.find_all('span', class_='font-weight-bold larger')
date_added = soup.find_all('span', class_='float-right d-none d-md-inline text-secondary')
company = soup.find_all('p', class_='m-0 text-secondary')
remote = 'https://remote.co/'
job_list = []
for a in jobs:
link = a['href']
print(f'Apply here: {remote}{link}')
job_list.append(link)
for b in titles:
job_list.append(b.text)
for c in date_added:
job_list(c.text)
for d in company:
job_list(d.text)
Here's the code I have written, can someone help me with organising it so that the first chunk of text will be
Link to Apply
Job Title
Date the Job was Added
Name of Company and Working Hours
Here is a snippet of the HTML from the site
<div class="card bg-light mb-3 rounded-0">
<div class="card-body">
<div class="d-flex align-items-center mb-3">
<h2 class="text-uppercase mb-0 mr-2 raleway" style="-webkit-box-flex:0;flex-grow:0;">Remote Developer Jobs</h2><div style="background:#00a2e1;-webkit-box-flex:1;flex-grow:1;height:3px;"></div>
</div>
<div class="card bg-white m-0">
<div class="card-body p-0">
<p class="p-3 m-0 border-bottom">
<a href="/remote-jobs/" style="font-size:18px;">
<em>
See all Remote Jobs >
</em>
</a>
</p>
<a href="/job/staff-frontend-web-developer-24/" class="card m-0 border-left-0 border-right-0 border-top-0 border-bottom">
<div class="card border-0 p-3 job-card bg-white">
<div class="row no-gutters align-items-center">
<div class="col-lg-1 col-md-2 position-static d-none d-md-block pr-md-3">
<img src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E" alt="Routable" class="card-img" data-lazy-src="https://remoteco.s3.amazonaws.com/wp-content/uploads/2021/07/27194326/routable-150x150.png"/><noscript><img src="https://remoteco.s3.amazonaws.com/wp-content/uploads/2021/07/27194326/routable-150x150.png" alt="Routable" class="card-img"/></noscript>
</div>
<div class="col position-static">
<div class="card-body px-3 py-0 pl-md-0">
<p class="m-0"><span class="font-weight-bold larger">Staff Frontend Web Developer</span><span class="float-right d-none d-md-inline text-secondary"><small><date>1 day ago</date></small></span></p>
<p class="m-0 text-secondary">
Routable
| <span class="badge badge-success"><small>Full-time</small></span>
| <span class="badge badge-success"><small>International</small></span>
</p>
</div>
</div>
</div>
</div>
</a>
You can try the next example:
from bs4 import BeautifulSoup
import requests
page = requests.get('https://remote.co/remote-jobs/developer')
soup = BeautifulSoup(page.content,'lxml')
data = []
for e in soup.select('div.card-body.p-0 > a'):
soup2 = BeautifulSoup(requests.get('https://remote.co'+e.get('href')).content,'lxml')
d = {
'title':soup2.h1.text,
'job_name':soup2.select_one('div.job_description > p').text,
'company':soup2.select_one('div.co_name > strong').text,
'date':soup2.select_one('.date_sm time').text.replace('Posted:',''),
'Link':'https://remote.co'+e.get('href')
}
data.append(d)
print(data)
Output:
[{'title': 'Principal Software Engineer at Wisetack', 'job_name': 'Principal Software Engineer', 'company': 'Wisetack', 'date': ' 2 hours ago', 'Link': 'https://remote.co/job/principal-software-engineer-26/'}, {'title': 'Staff Frontend Web Developer at Routable', 'job_name': 'Staff Frontend Web Developer', 'company': 'Routable', 'date': ' 1 day ago', 'Link': 'https://remote.co/job/staff-frontend-web-developer-24/'}, {'title': 'Developer Advocate at DeepSource', 'job_name': 'Developer Advocate', 'company': 'DeepSource', 'date': ' 2 days ago', 'Link': 'https://remote.co/job/developer-advocate-24/'}, {'title': 'Senior GCP DevOps Engineer at RXMG', 'job_name': 'Location:\xa0 US Locations Only; 100% Remote', 'company': 'RXMG', 'date': ' 3 days ago', 'Link': 'https://remote.co/job/senior-gcp-devops-engineer-23/'}, {'title': 'Growth Engineer, MarTech at Facet Wealth', 'job_name': 'Location:\xa0 US Locations Only; 100% Remote', 'company': 'Facet Wealth', 'date': ' 3 days ago', 'Link': 'https://remote.co/job/growth-engineer-martech-23/'}, {'title': 'DevOps Engineer at Oddball', 'job_name': 'Location:\xa0 US Locations Only; 100% Remote', 'company': 'Oddball', 'date': ' 3 days ago', 'Link': 'https://remote.co/job/devops-engineer-66/'}, {'title': 'DevOps Engineer at Paymentology', 'job_name': 'Location:\xa0 International, Anywhere; 100% remote', 'company': 'Paymentology', 'date': ' 4 days ago', 'Link': 'https://remote.co/job/devops-engineer-67/'}, {'title': 'Director, Core Technology Software Development at Andela', 'job_name': 'Title: Director, Core Technology Software Development', 'company': 'Andela', 'date': ' 4 days ago', 'Link': 'https://remote.co/job/director-core-technology-software-development-22/'}, {'title': 'Senior Developer – Net Core/C#/SQL (REMOTE or Local) at Cascade Financial Technology', 'job_name': 'Location:\xa0 US Locations Only; 100% Remote', 'company': 'Cascade Financial Technology', 'date': ' 4 days ago', 'Link': 'https://remote.co/job/senior-developer-net-core-c-sql-remote-or-local-22/'}, {'title': 'Front End Android Developer at Cascade Financial Technology', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote', 'company': 'Cascade Financial Technology', 'date': ' 4 days ago', 'Link': 'https://remote.co/job/front-end-android-developer-22/'}, {'title': 'Senior Backend Engineer – Python at Doist', 'job_name': 'Senior Backend Engineer (Python)', 'company': 'Doist', 'date': ' 5 days ago', 'Link': 'https://remote.co/job/senior-backend-engineer-python-21/'}, {'title': "Front End Developer at Brad's Deals", 'job_name': 'Front End Developer', 'company': "Brad's Deals", 'date': ' 5 days ago', 'Link': 'https://remote.co/job/front-end-developer-21-2/'}, {'title': 'Director of Engineering at Farmgirl Flowers', 'job_name': 'Director of Engineering', 'company': 'Farmgirl Flowers', 'date': ' 5 days ago', 'Link': 'https://remote.co/job/director-of-engineering-21/'}, {'title': 'Software Engineer, Backend Identity at Affirm', 'job_name': 'Title: Software Engineer, Backend (Identity)', 'company': 'Affirm', 'date': ' 5 days ago', 'Link': 'https://remote.co/job/software-engineer-backend-identity-21/'}, {'title': 'Backend Developer (Node/Typescript) at CitizenShipper', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote', 'company': 'CitizenShipper', 'date': ' 6 days ago', 'Link': 'https://remote.co/job/backend-developer-node-typescript-20/'}, {'title': 'Fullstack Developer (TypeScript) at CitizenShipper', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote', 'company': 'CitizenShipper', 'date': ' 6 days ago', 'Link': 'https://remote.co/job/fullstack-developer-typescript-20/'}, {'title': 'Senior Software Engineer- Java at Method, Inc.', 'job_name': 'Location:\xa0 US Locations; 100% Remote', 'company': 'Method, Inc.', 'date': ' 6 days ago', 'Link': 'https://remote.co/job/senior-software-engineer-java-2/'}, {'title': 'Senior Software Engineer – Backend at Varsity Tutors', 'job_name': 'Title:\xa0Senior Software Engineer (Backend) – Golang', 'company': 'Varsity
Tutors', 'date': ' 6 days ago', 'Link': 'https://remote.co/job/senior-software-engineer-backend-20/'}, {'title': 'Backend Engineer, Growth Engineering at Stripe, Inc.', 'job_name': 'Backend Engineer, Growth Engineering', 'company':
'Stripe, Inc.', 'date': ' 6 days ago', 'Link': 'https://remote.co/job/backend-engineer-growth-engineering-20/'}, {'title': 'Game Developer at Voodoo', 'job_name': 'Game Developer', 'company': 'Voodoo', 'date': ' 6 days ago', 'Link': 'https://remote.co/job/game-developer-20/'}, {'title': 'Senior Ruby Engineer at Clearcover', 'job_name': 'Title: Sr. Ruby Engineer', 'company': 'Clearcover', 'date': ' 1 week ago', 'Link': 'https://remote.co/job/senior-ruby-engineer-18/'}, {'title': 'Ruby Engineer at Clearcover', 'job_name': 'Title: Ruby Engineer', 'company': 'Clearcover', 'date': ' 1 week ago', 'Link': 'https://remote.co/job/ruby-engineer-17/'}, {'title': 'DevOps Engineer at OCCRP', 'job_name': 'Location:\xa0 International, Anywhere; Freelance', 'company': 'OCCRP', 'date': ' 1 week ago', 'Link': 'https://remote.co/job/devops-engineer-65/'}, {'title': 'Python Developer at ScienceLogic', 'job_name': 'Title:\xa0Python Developer', 'company': 'ScienceLogic', 'date': ' 1 week ago', 'Link': 'https://remote.co/job/python-developer-16/'}, {'title': 'Senior Software Engineer – App Stores Backend at Canonical', 'job_name': 'Title:\xa0Senior Software Engineer – App Stores Backend (Remote)', 'company': 'Canonical', 'date': ' 1 week ago', 'Link': 'https://remote.co/job/senior-software-engineer-app-stores-backend-16/'}, {'title': 'Software Engineer, Backend – Machine Learning Platform at
Affirm', 'job_name': 'Software Engineer, Backend (Machine Learning Platform)', 'company': 'Affirm', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/software-engineer-backend-machine-learning-platform-14/'}, {'title': 'Senior
Engineering Manager, Billing at Webflow', 'job_name': 'Title: Senior Engineering Manager, Billing', 'company': 'Webflow', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/senior-engineering-manager-billing-14/'}, {'title': 'Senior Software Engineer, Anti-Tracking at Mozilla', 'job_name': 'Title: Senior Software Engineer, Anti-Tracking', 'company': 'Mozilla', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/senior-software-engineer-anti-tracking-14/'}, {'title': 'Director of Engineering at Conserv', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote', 'company': 'Conserv', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/director-of-engineering-14/'}, {'title': 'Lead Front End Developer- Email at Stitch Fix', 'job_name': 'Title:\xa0Lead Front End Developer- Email', 'company': 'Stitch Fix', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/lead-front-end-developer-email-13/'}, {'title': 'Technical Lead Growth Monetization, Frontend at HubSpot', 'job_name': 'Technical Lead Growth Monetization, Frontend (US/Remote)', 'company': 'HubSpot', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/technical-lead-growth-monetization-frontend-11/'}, {'title': 'Senior Software Engineer, Backend Debit+ at Affirm', 'job_name': 'Title:\xa0Senior Software Engineer, Backend\xa0(Debit+)', 'company': 'Affirm', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/senior-software-engineer-backend-debit-11/'}, {'title': 'C++ Graphics and Windowing System Software Engineer at Canonical', 'job_name': 'Title:\xa0C++ Graphics and Windowing System Software Engineer\xa0– Mir', 'company': 'Canonical', 'date': ' 2 weeks ago', 'Link': 'https://remote.co/job/c-graphics-and-windowing-system-software-engineer-9/'}, {'title': 'Senior Manager, Software Engineering at Myriad Genetics', 'job_name': 'Title:\xa0Senior Manager, Software Engineering', 'company': 'Myriad Genetics', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-manager-software-engineering-8/'}, {'title': 'Senior Kernel Build Automation Engineer at Canonical', 'job_name': 'Title: Senior Kernel Build Automation Engineer ', 'company': 'Canonical', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-kernel-build-automation-engineer-8/'}, {'title': 'Engineering Manager – Full Stack at Betterment', 'job_name': 'Title: Engineering Manager – Full Stack', 'company': 'Betterment', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/engineering-manager-full-stack-7/'}, {'title': 'Principal Architect – Software Engineering at Citizens Bank', 'job_name': 'Principal Architect – Software Engineering', 'company': 'Citizens Bank', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/principal-architect-software-engineering-7/'}, {'title': 'Senior Software Engineer, Kubernetes Platform at Appboy', 'job_name': 'Title:\xa0Senior Software Engineer, Kubernetes Platform', 'company': 'Appboy', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-software-engineer-kubernetes-platform-7/'}, {'title': 'Senior React Native Developer at Toptal', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote; Freelance', 'company': 'Toptal', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-react-native-developer-11/'}, {'title': 'Senior Blockchain Developer at Toptal', 'job_name': 'Location: International, Anywhere; 100% Remote; Freelance', 'company': 'Toptal', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-blockchain-developer-5/'}, {'title': 'Front-End Developer at Toptal', 'job_name': 'Location: International, Anywhere; 100% Remote; Freelance', 'company': 'Toptal', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/front-end-developer-5-2/'}, {'title': 'Senior DevOps Engineer at Toptal', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote; Freelance', 'company': 'Toptal', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-devops-engineer-11-2/'}, {'title': 'Senior React Developer at Toptal', 'job_name': 'Location: Anywhere, International;\xa0 Freelance;\xa0 100% Remote', 'company': 'Toptal', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-react-developer-5/'}, {'title': 'Full-Stack Developer at Toptal', 'job_name': 'Location: International, Anywhere; 100% Remote; Freelance', 'company': 'Toptal', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/full-stack-developer-5-2/'}, {'title': 'Senior Full Stack Developer: Long-term job – 100% remote at Proxify AB', 'job_name': 'Location:\xa0 International, Anywhere; 100% Remote; Freelance', 'company': 'Proxify AB', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-full-stack-developer-long-term-job-100-remote-6/'}, {'title': 'Software Engineer – Backend at 0x', 'job_name': 'Software Engineer – Backend (Campus)', 'company': '0x', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/software-engineer-backend-5-2/'}, {'title': 'Engineering Manager at Array.com', 'job_name': 'Engineering Manager', 'company': 'Array.com', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/engineering-manager-5-2/'}, {'title': 'Senior Software Engineer, Canvas Facilitation at MURAL.co', 'job_name': 'Senior Software Engineer, Canvas Facilitation', 'company': 'MURAL.co', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/senior-software-engineer-canvas-facilitation-5/'}, {'title': 'Backend Engineer at CareRev', 'job_name': 'Title:\xa0Backend Engineer', 'company': 'CareRev', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/backend-engineer-5-2/'}, {'title': 'Principal Software Engineer, Architect Cognitive Automation at Appian', 'job_name': 'Title:\xa0Principal Software Engineer/Architect (Cognitive Automation)', 'company': 'Appian', 'date': ' 3 weeks ago', 'Link': 'https://remote.co/job/principal-software-engineer-architect-cognitive-automation-5/'}]
Your lists have a little bit of unnecessary data at the moment. Can you provide an example of how it is supposed to look in the end?
However, can use zip() to iterate over all lists at the same time:
jobs = soup.find_all('a', class_='card m-0 border-left-0 border-right-0 border-top-0 border-bottom')
titles = soup.find_all('span', class_='font-weight-bold larger')
dates_added = soup.find_all('span', class_='float-right d-none d-md-inline text-secondary')
companies = soup.find_all('p', class_='m-0 text-secondary')
for job, title, date_added, company in zip(jobs, titles, dates_added, companies):
print(job, title, date_added, company)
I'm attempting to get data from Wikipedias sidebar on the 'Current Events' page with the below. At the moment this produces an array of Objects each with value title and url.
I would also like to provide a new value to the objects in array headline derived from the <h3> id or text content. This would result in each object having three values: headline, url and title. However, I'm unsure how to iterate through these.
Beautiful Soup Code
soup = BeautifulSoup(response, "html.parser").find('div', {'aria-labelledby': 'Ongoing_events'})
links = soup.findAll('a')
for item in links:
title = item.text
url = ("https://en.wikipedia.org"+item['href'])
eo = CurrentEventsObject(title, url)
eventsArray.append(eo)
Wikipedia Current Events List
<div class="mw-collapsible-content">
<h3><span class="mw-headline" id="Disasters">Disasters</span</h3>
<ul>
<li>Climate crisis</li>
<li>COVID-19 pandemic</li>
<li>2021–22 European windstorm season</li>
<li>2020–21 H5N8 outbreak</li>
<li>2021 Pacific typhoon season</li>
<li>Madagascar food crisis</li>
<li>Water crisis in Iran</li>
<li>Yemeni famine</li>
<li>2021 La Palma eruption</li>
</ul>
<h3><span class="mw-headline" id="Economic">Economic</span></h3>
<ul>
<li>2020–2021 global chip shortage</li>
<li>2021 global supply chain crisis</li>
<li>COVID-19 recession</li>
<li>Lebanese liquidity crisis</li>
<li>Pandora Papers leak</li>
<li>Sri Lankan economic and food crisis</li>
<li>Turkish currency and debt crisis</li>
<li>United Kingdom natural gas supplier crisis</li>
</ul>
<h3><span class="mw-headline" id="Politics">Politics</span></h3>
<ul>
<li>Belarus−European Union border crisis</li>
<li>Brazilian protests</li>
<li>Colombian tax reform protests</li>
<li>Eswatini protests</li>
<li>Haitian protests</li>
<li>Indian farmers' protests</li>
<li>Insulate Britain protests</li>
<li>Jersey dispute</li>
<li>Libyan peace process</li>
<li>Malaysian political crisis</li>
<li>Myanmar protests</li>
<li>Nicaraguan protests</li>
<li>Nigerian protests</li>
<li>Persian Gulf crisis</li>
<li>Peruvian crisis</li>
<li>Russian election protests</li>
<li>Solomon Islands unrest</li>
<li>Tigrayan peace process</li>
<li>Thai protests</li>
<li>Tunisian political crisis</li>
<li>United States racial unrest</li>
<li>Venezuelan presidential crisis</li>
</ul>
<div class="editlink noprint plainlinks"><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Portal:Current_events/Sidebar&action=edit">edit section</a></div>
</div>
Note: Try to select your elements more specific to get all information in one process - Defining a list outside your loops will avoid from overwriting
Following steps will create a list of dicts, that for example could simply iterated or turned into a data frame.
#1
Select all <ul> that are direct siblings of a <h3>
soup.select('h3 + ul')
#2 Select the <h3> and get its text:
e.find_previous_sibling('h3').text.strip()
#3 Select all <a> in the <ul> and iterat the results while creating a list of dicts:
for a in e.select('a'):
data.append({
'headline':headline,
'title': a['title'],
'url':'https://en.wikipedia.org'+a['href']
})
Example
soup = BeautifulSoup(response, "html.parser").find('div', {'aria-labelledby': 'Ongoing_events'})
data = []
for e in soup.select('h3 + ul'):
headline = e.find_previous_sibling('h3').text.strip()
for a in e.select('a'):
data.append({
'headline':headline,
'title': a['title'],
'url':'https://en.wikipedia.org'+a['href']
})
data
Output
[{'headline': 'Disasters',
'title': 'Climate crisis',
'url': 'https://en.wikipedia.org/wiki/Climate_crisis'},
{'headline': 'Disasters',
'title': 'COVID-19 pandemic',
'url': 'https://en.wikipedia.org/wiki/COVID-19_pandemic'},
{'headline': 'Disasters',
'title': '2021–22 European windstorm season',
'url': 'https://en.wikipedia.org/wiki/2021%E2%80%9322_European_windstorm_season'},
{'headline': 'Disasters',
'title': '2020–21 H5N8 outbreak',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%9321_H5N8_outbreak'},
{'headline': 'Disasters',
'title': '2021 Pacific typhoon season',
'url': 'https://en.wikipedia.org/wiki/2021_Pacific_typhoon_season'},
{'headline': 'Disasters',
'title': '2021 Madagascar food crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Madagascar_food_crisis'},
{'headline': 'Disasters',
'title': 'Water scarcity in Iran',
'url': 'https://en.wikipedia.org/wiki/Water_scarcity_in_Iran'},
{'headline': 'Disasters',
'title': 'Famine in Yemen (2016–present)',
'url': 'https://en.wikipedia.org/wiki/Famine_in_Yemen_(2016%E2%80%93present)'},
{'headline': 'Disasters',
'title': '2021 Cumbre Vieja volcanic eruption',
'url': 'https://en.wikipedia.org/wiki/2021_Cumbre_Vieja_volcanic_eruption'},
{'headline': 'Economic',
'title': '2020–2021 global chip shortage',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_global_chip_shortage'},
{'headline': 'Economic',
'title': '2021 global supply chain crisis',
'url': 'https://en.wikipedia.org/wiki/2021_global_supply_chain_crisis'},
{'headline': 'Economic',
'title': 'COVID-19 recession',
'url': 'https://en.wikipedia.org/wiki/COVID-19_recession'},
{'headline': 'Economic',
'title': 'Lebanese liquidity crisis',
'url': 'https://en.wikipedia.org/wiki/Lebanese_liquidity_crisis'},
{'headline': 'Economic',
'title': 'Pandora Papers',
'url': 'https://en.wikipedia.org/wiki/Pandora_Papers'},
{'headline': 'Economic',
'title': '2021 Sri Lankan economic crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Sri_Lankan_economic_crisis'},
{'headline': 'Economic',
'title': '2018–2021 Turkish currency and debt crisis',
'url': 'https://en.wikipedia.org/wiki/2018%E2%80%932021_Turkish_currency_and_debt_crisis'},
{'headline': 'Economic',
'title': '2021 United Kingdom natural gas supplier crisis',
'url': 'https://en.wikipedia.org/wiki/2021_United_Kingdom_natural_gas_supplier_crisis'},
{'headline': 'Politics',
'title': '2021 Belarus–European Union border crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Belarus%E2%80%93European_Union_border_crisis'},
{'headline': 'Politics',
'title': '2021 Brazilian protests',
'url': 'https://en.wikipedia.org/wiki/2021_Brazilian_protests'},
{'headline': 'Politics',
'title': '2021 Colombian protests',
'url': 'https://en.wikipedia.org/wiki/2021_Colombian_protests'},
{'headline': 'Politics',
'title': '2021 Eswatini protests',
'url': 'https://en.wikipedia.org/wiki/2021_Eswatini_protests'},
{'headline': 'Politics',
'title': '2018–2021 Haitian protests',
'url': 'https://en.wikipedia.org/wiki/2018%E2%80%932021_Haitian_protests'},
{'headline': 'Politics',
'title': "2020–2021 Indian farmers' protest",
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_Indian_farmers%27_protest'},
{'headline': 'Politics',
'title': 'Insulate Britain protests',
'url': 'https://en.wikipedia.org/wiki/Insulate_Britain_protests'},
{'headline': 'Politics',
'title': '2021 Jersey dispute',
'url': 'https://en.wikipedia.org/wiki/2021_Jersey_dispute'},
{'headline': 'Politics',
'title': 'Libyan peace process',
'url': 'https://en.wikipedia.org/wiki/Libyan_peace_process'},
{'headline': 'Politics',
'title': '2020–21 Malaysian political crisis',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%9321_Malaysian_political_crisis'},
{'headline': 'Politics',
'title': '2021 Myanmar protests',
'url': 'https://en.wikipedia.org/wiki/2021_Myanmar_protests'},
{'headline': 'Politics',
'title': '2018–2021 Nicaraguan protests',
'url': 'https://en.wikipedia.org/wiki/2018%E2%80%932021_Nicaraguan_protests'},
{'headline': 'Politics',
'title': 'End SARS',
'url': 'https://en.wikipedia.org/wiki/End_SARS'},
{'headline': 'Politics',
'title': '2019–2021 Persian Gulf crisis',
'url': 'https://en.wikipedia.org/wiki/2019%E2%80%932021_Persian_Gulf_crisis'},
{'headline': 'Politics',
'title': '2017–present Peruvian political crisis',
'url': 'https://en.wikipedia.org/wiki/2017%E2%80%93present_Peruvian_political_crisis'},
{'headline': 'Politics',
'title': '2021 Russian election protests',
'url': 'https://en.wikipedia.org/wiki/2021_Russian_election_protests'},
{'headline': 'Politics',
'title': '2021 Solomon Islands unrest',
'url': 'https://en.wikipedia.org/wiki/2021_Solomon_Islands_unrest'},
{'headline': 'Politics',
'title': 'Tigrayan peace process',
'url': 'https://en.wikipedia.org/wiki/Tigrayan_peace_process'},
{'headline': 'Politics',
'title': '2020–2021 Thai protests',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_Thai_protests'},
{'headline': 'Politics',
'title': '2021 Tunisian political crisis',
'url': 'https://en.wikipedia.org/wiki/2021_Tunisian_political_crisis'},
{'headline': 'Politics',
'title': '2020–2021 United States racial unrest',
'url': 'https://en.wikipedia.org/wiki/2020%E2%80%932021_United_States_racial_unrest'},
{'headline': 'Politics',
'title': 'Venezuelan presidential crisis',
'url': 'https://en.wikipedia.org/wiki/Venezuelan_presidential_crisis'}]
I am trying to use the scrape_linkedin package. I follow the section on the github page on how to set up the package/LinkedIn li_at key (which I paste here for clarity).
Getting LI_AT
Navigate to www.linkedin.com and log in
Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
Find and copy the li_at value
Once I collect the li_at value from my LinkedIn, I run the following code:
from scrape_linkedin import ProfileScraper
with ProfileScraper(cookie='myVeryLong_li_at_Code_which_has_characters_like_AQEDAQNZwYQAC5_etc') as scraper:
profile = scraper.scrape(url='https://www.linkedin.com/in/justintrudeau/')
print(profile.to_dict())
I have two questions (I am originally an R user).
How can I input a list of profiles:
https://www.linkedin.com/in/justintrudeau/
https://www.linkedin.com/in/barackobama/
https://www.linkedin.com/in/williamhgates/
https://www.linkedin.com/in/wozniaksteve/
and scrape the profiles? (In R I would use the map function from the purrr package to apply the function to each of the LinkedIn profiles).
The output (from the original github page) is returned in a JSON style format. My second question is how I can convert this into a pandas data frame (i.e. it is returned similar to the following).
{'personal_info': {'name': 'Steve Wozniak', 'headline': 'Fellow at
Apple', 'company': None, 'school': None, 'location': 'San Francisco
Bay Area', 'summary': '', 'image': '', 'followers': '', 'email': None,
'phone': None, 'connected': None, 'websites': [],
'current_company_link': 'https://www.linkedin.com/company/sandisk/'},
'experiences': {'jobs': [{'title': 'Chief Scientist', 'company':
'Fusion-io', 'date_range': 'Jul 2014 – Present', 'location': 'Primary
Data', 'description': "I'm looking into future technologies applicable
to servers and storage, and helping this company, which I love, get
noticed and get a lead so that the world can discover the new amazing
technology they have developed. My role is principally a marketing one
at present but that will change over time.", 'li_company_url':
'https://www.linkedin.com/company/sandisk/'}, {'title': 'Fellow',
'company': 'Apple', 'date_range': 'Mar 1976 – Present', 'location': '1
Infinite Loop, Cupertino, CA 94015', 'description': 'Digital Design
engineer.', 'li_company_url': ''}, {'title': 'President & CTO',
'company': 'Wheels of Zeus', 'date_range': '2002 – 2005', 'location':
None, 'description': None, 'li_company_url':
'https://www.linkedin.com/company/wheels-of-zeus/'}, {'title':
'diagnostic programmer', 'company': 'TENET Inc.', 'date_range': '1970
– 1971', 'location': None, 'description': None, 'li_company_url':
''}], 'education': [{'name': 'University of California, Berkeley',
'degree': 'BS', 'grades': None, 'field_of_study': 'EE & CS',
'date_range': '1971 – 1986', 'activities': None}, {'name': 'University
of Colorado Boulder', 'degree': 'Honorary PhD.', 'grades': None,
'field_of_study': 'Electrical and Electronics Engineering',
'date_range': '1968 – 1969', 'activities': None}], 'volunteering':
[]}, 'skills': [], 'accomplishments': {'publications': [],
'certifications': [], 'patents': [], 'courses': [], 'projects': [],
'honors': [], 'test_scores': [], 'languages': [], 'organizations':
[]}, 'interests': ['Western Digital', 'University of Colorado
Boulder', 'Western Digital Data Center Solutions', 'NEW Homebrew
Computer Club', 'Wheels of Zeus', 'SanDisk®']}
Firstly, You can create a custom function to scrape data and use map function in Python to apply it over each profile link.
Secondly, to create a pandas dataframe using a dictionary, you can simply pass the dictionary to pd.DataFrame.
Thus to create a dataframe df, with dictionary dict, you can do like this:
df = pd.DataFrame(dict)
I'm new to the concept of generators and I'm struggling with how to apply my changes to the records within the generator object returned from the RISparser module.
I understand that a generator only reads a record at a time and doesn't actually store the data in memory but I'm having a tough time iterating over it effectively and applying my changes.
My changes will involve dropping records that have not got ['doi'] values that are contained within a list of DOIs [doi_match].
doi_match = ['10.1002/14651858.CD008259.pub2','10.1002/14651858.CD011552','10.1002/14651858.CD011990']
Generator object returned form RISparser contains the following information, this is just the first 2 records returned of a few 100. I want to iterate over it and compare the 'doi': key from the generator with the list of DOIs.
{'type_of_reference': 'JOUR', 'title': "The CoRe Outcomes in WomeN's health (CROWN) initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Neurourology and Urodynamics', 'alternate_title1': 'Neurourol. Urodyn.', 'volume': '33', 'number': '8', 'start_page': '1176', 'end_page': '1177', 'year': '2014', 'doi': '10.1002/nau.22674', 'issn': '07332467 (ISSN)', 'authors': ['Khan, K.'], 'keywords': ['Bias (epidemiology)', 'Clinical trials', 'Consensus', 'Endpoint determination/standards', 'Evidence-based medicine', 'Guidelines', 'Research design/standards', 'Systematic reviews', 'Treatment outcome', 'consensus', 'editor', 'female', 'human', 'medical literature', 'Note', 'outcomes research', 'peer review', 'randomized controlled trial (topic)', 'systematic review (topic)', "women's health", 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'John Wiley and Sons Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: NEURE'], 'type_of_work': 'Note', 'name_of_database': 'Scopus', 'custom2': '25270392', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908368202&doi=10.1002%2fnau.22674&partnerID=40&md5=b220702e005430b637ef9d80a94dadc4'}
{'type_of_reference': 'JOUR', 'title': "The CROWN initiative: Journal editors invite researchers to develop core outcomes in women's health", 'secondary_title': 'Gynecologic Oncology', 'alternate_title1': 'Gynecol. Oncol.', 'volume': '134', 'number': '3', 'start_page': '443', 'end_page': '444', 'year': '2014', 'doi': '10.1016/j.ygyno.2014.05.005', 'issn': '00908258 (ISSN)', 'authors': ['Karlan, B.Y.'], 'author_address': 'Gynecologic Oncology and Gynecologic Oncology Reports, India', 'keywords': ['clinical trial (topic)', 'decision making', 'Editorial', 'evidence based practice', 'female infertility', 'health care personnel', 'human', 'outcome assessment', 'outcomes research', 'peer review', 'practice guideline', 'premature labor', 'priority journal', 'publication', 'systematic review (topic)', "women's health", 'editorial', 'female', 'outcome assessment', 'personnel', 'publication', 'Female', 'Humans', 'Outcome Assessment (Health Care)', 'Periodicals as Topic', 'Research Personnel', "Women's Health"], 'publisher': 'Academic Press Inc.', 'notes': ['Export Date: 14 July 2020', 'CODEN: GYNOA', 'Correspondence Address: Karlan, B.Y.; Gynecologic Oncology and Gynecologic Oncology ReportsIndia'], 'type_of_work': 'Editorial', 'name_of_database': 'Scopus', 'custom2': '25199578', 'language': 'English', 'url': 'https://www.scopus.com/inward/record.uri?eid=2-s2.0-84908351159&doi=10.1016%2fj.ygyno.2014.05.005&partnerID=40&md5=ab5a4d26d52c12d081e38364b0c79678'}
I tried iterating over the generator and applying the changes. But the records that have matches are not being placed in the match list.
match = []
for entry in ris_records:
if entry['doi'] in doi_match:
match.append(entry)
else:
del entry
any advice on how to iterate over a generator correctly, thanks.
I am scraping a website nykaa.com and the link is (https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=1). There are 25 pages and the data loads dynamically per page. I am unable to find the source of the data. Moreover when Scrape the data I am only able to get 20 products which become redundant and the list becomes 420 products.
import requests
from bs4 import BeautifulSoup
import unicodecsv as csv
urls = []
l1 = []
for page in range(1,5):
result = requests.get("https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=" + str(page))
src = result.content
soup = BeautifulSoup(src,'lxml')
for div_tag in soup.find_all("div", class_ = "card-wrapper-container col-xs-12 col-sm-6 col-md-4"):
for div1_tag in soup.find_all("div", class_ = "product-list-box card desktop-cart"):
h2_tag = div1_tag.find("h2").find("span")
price_tag = div1_tag.find("div", class_ = "price-info")
l1 = [h2_tag.get_text(),price_tag.get_text()]
urls.append(l1)
#print(urls)
with open('xyz.csv', 'wb') as myfile:
wr = csv.writer(myfile)
wr.writerows(urls)
The above code fetches me a list of around 1200 product names and prices, out of which only 30 to 40 are unique otherwise all are redundant. I want to fetch data of 25 pages uniquely and there are total 486 unique products. I also used selenium to click the next page link but that also didn't work out.
This shows making the request the page does (as viewed in network tab) in a loop over all pages (including determing number of pages). results is a list of lists you can easily write to csv.
import requests, math, csv
page = '1'
def append_new_rows(data):
for i in data:
if 'name' in i:
results.append([i['name'], i['final_price']])
with requests.Session() as s:
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
results_per_page = 20
total_results = r['response']['total_found']
num_pages = math.ceil(total_results/results_per_page)
results = []
append_new_rows(r['response']['products'])
for page in range(2, num_pages + 1):
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
append_new_rows(r['response']['products'])
with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Name','Price'])
for row in results:
w.writerow(row)
You can use selenium:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397')
def get_products(_d):
return [{'title':(lambda x:x if not x else x.text)(i.find('div', {'class':'m-content__product-list__title'})), 'price':(lambda x:x if not x else x.text)(i.find('span', {'class':'post-card__content-price-offer'}))} for i in _d.find_all('div', {'class':'card-wrapper-container col-xs-12 col-sm-6 col-md-4'})]
s = soup(d.page_source, 'html.parser')
r = [list(filter(None, get_products(s)))]
while 'disable-event' not in s.find('li', {'class':'next'}).attrs['class']:
d.get(f"https://www.nykaa.com{s.find('li', {'class':'next'}).a['href']}")
s = soup(d.page_source, 'html.parser')
r.append(list(filter(None, get_products(s))))
Sample output (first three pages):
[[{'title': 'The Face Shop Calendula Essential Moisture Serum', 'price': '₹1320 '}, {'title': 'Palmers Cocoa Butter Formula Skin Perfecting Ultra Hydrating...', 'price': '₹970 '}, {'title': "Cheryl's Cosmeceuticals Clarifi Acne Anti Blemish Serum", 'price': '₹875 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹1250 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹1250 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹3900 '}, {'title': 'Klairs Freshly Juiced Vitamin Drop', 'price': '₹1492 '}, {'title': 'Innisfree The Green Tea Seed Serum', 'price': '₹1950 '}, {'title': "Kiehl's Midnight Recovery Concentrate", 'price': '₹2100 '}, {'title': 'The Face Shop White Seed Brightening Serum', 'price': '₹1990 '}, {'title': 'Biotique Bio Dandelion Visibly Ageless Serum', 'price': '₹230 '}, {'title': None, 'price': None}, {'title': 'St.Botanica Vitamin C 20% + Vitamin E & Hyaluronic Acid Faci...', 'price': '₹1499 '}, {'title': 'Biotique Bio Coconut Whitening & Brightening Cream', 'price': '₹199 '}, {'title': 'Neutrogena Fine Fairness Brightening Serum', 'price': '₹849 '}, {'title': "Kiehl's Clearly Corrective Dark Spot Solution", 'price': '₹4300 '}, {'title': "Kiehl's Clearly Corrective Dark Spot Solution", 'price': '₹4300 '}, {'title': 'Lakme Absolute Perfect Radiance Skin Lightening Serum', 'price': '₹960 '}, {'title': 'St.Botanica Hyaluronic Acid + Vitamin C, E Facial Serum', 'price': '₹1499 '}, {'title': 'Jeva Vitamin C Serum with Hyaluronic Acid for Anti Aging and...', 'price': '₹350 '}, {'title': 'Lotus Professional Phyto-Rx Whitening & Brightening Serum', 'price': '₹595 '}], [{'title': 'The Face Shop Chia Seed Moisture Recharge Serum', 'price': '₹1890 '}, {'title': 'Lotus Herbals WhiteGlow Skin Whitening & Brightening Gel Cre...', 'price': '₹280 '}, {'title': 'Lakme 9 to 5 Naturale Aloe Aqua Gel', 'price': '₹200 '}, {'title': 'Estee Lauder Advanced Night Repair Synchronized Recovery Com...', 'price': '₹5900 '}, {'title': 'Mixify Unloc Skin Glow Serum', 'price': '₹499 '}, {'title': 'St.Botanica Retinol 2.5% + Vitamin E & Hyaluronic Acid Profe...', 'price': '₹1499 '}, {'title': 'LANEIGE Hydration Combo Set', 'price': '₹3000 '}, {'title': 'Biotique Bio Dandelion Ageless Visiblly Serum', 'price': '₹690 '}, {'title': 'The Moms Co. Natural Vita Rich Face Serum', 'price': '₹699 '}, {'title': "It's Skin Power 10 Formula VC Effector", 'price': '₹950 '}, {'title': "Kiehl's Powerful-Strength Line-Reducing Concentrate", 'price': '₹5100 '}, {'title': 'Olay Natural White Light Instant Glowing Fairness Skin Cream', 'price': '₹99 '}, {'title': 'Plum Green Tea Skin Clarifying Concentrate', 'price': '₹881 '}, {'title': 'Olay Total Effects 7 In One Anti-Ageing Smoothing Serum', 'price': '₹764 '}, {'title': 'Elizabeth Arden Ceramide Daily Youth Restoring Serum 60 Caps...', 'price': '₹5850 '}, {'title': None, 'price': None}, {'title': 'Olay Regenerist Advanced Anti-Ageing Micro-Sculpting Serum', 'price': '₹1699 '}, {'title': 'Lakme Absolute Argan Oil Radiance Overnight Oil-in-Serum', 'price': '₹945 '}, {'title': 'The Face Shop Mango Seed Silk Moisturizing Emulsion', 'price': '₹1890 '}, {'title': 'The Face Shop Calendula Essential Good to Glow Combo', 'price': '₹2557 '}, {'title': 'Garnier Skin Naturals Light Complete Serum Cream', 'price': '₹69 '}], [{'title': 'Clinique Moisture Surge Hydrating Supercharged Concentrate', 'price': '₹2550 '}, {'title': 'LANEIGE Sleeping Mask Combo', 'price': '₹3000 '}, {'title': 'Klairs Rich Moist Soothing Serum', 'price': '₹1492 '}, {'title': 'Estee Lauder Idealist Pore Minimizing Skin Refinisher', 'price': '₹5500 '}, {'title': 'O3+ Whitening & Brightening Serum', 'price': '₹1475 '}, {'title': 'Elizabeth Arden Ceramide Daily Youth Restoring Serum 90 Caps...', 'price': '₹6900 '}, {'title': 'Olay Natural White Light Instant Glowing Fairness Skin Cream', 'price': '₹189 '}, {'title': "L'Oreal Paris White Perfect Clinical Expert Anti-Spot Whiten...", 'price': '₹1480 '}, {'title': 'belif Travel Kit', 'price': '₹1499 '}, {'title': 'Forest Essentials Advanced Soundarya Serum With 24K Gold', 'price': '₹3975 '}, {'title': "L'Occitane Immortelle Reset Serum", 'price': '₹4500 '}, {'title': 'Lakme Absolute Skin Gloss Reflection Serum 30ml', 'price': '₹990 '}, {'title': 'Neutrogena Hydro Boost Emulsion', 'price': '₹999 '}, {'title': 'Innisfree Anti-Aging Set', 'price': '₹2350 '}, {'title': 'Clinique Fresh Pressed 7-Day System With Pure Vitamin C', 'price': '₹2400 '}, {'title': 'The Face Shop The Therapy Premier Serum', 'price': '₹2490 '}, {'title': 'The Body Shop Vitamin E Overnight Serum In Oil', 'price': '₹1695 '}, {'title': 'Jeva Vitamin C Serum with Hyaluronic Acid for Anti Aging and...', 'price': '₹525 '}, {'title': 'Olay Regenerist Micro Sculpting Cream and White Radiance Hyd...', 'price': '₹2698 '}, {'title': 'The Face Shop Yehwadam Pure Brightening Serum', 'price': '₹4350 '}]]