I am writing a script that scrapes information from a large travel agency. My code closely follows the tutorial at https://python.gotrained.com/selenium-scraping-booking-com/.
However, I would like to be able to navigate to the next page as I'm now limited to n_results = 25. Where do I add this in the code? I know that I can target the pagination button with driver.find_element_by_class_name('paging-next').click(), but I don't know where to incorporate it.
I have tried to put it in the for loop within the scrape_results function, which I have copied below. However, it doesn't seem to work.
def scrape_results(driver, n_results):
'''Returns the data from n_results amount of results.'''
accommodations_urls = list()
accommodations_data = list()
for accomodation_title in driver.find_elements_by_class_name('sr-hotel__title'):
accommodations_urls.append(accomodation_title.find_element_by_class_name(
'hotel_name_link').get_attribute('href'))
for url in range(0, n_results):
if url == n_results:
break
url_data = scrape_accommodation_data(driver, accommodations_urls[url])
accommodations_data.append(url_data)
return accommodations_data
EDIT
I have added some more code to clarify my input and output. Again, I mostly just used code from the GoTrained tutorial and added some code of my own. How I understand it: the scraper first collects all URLs and then scrapes the info of the individual pages one by one. I need to add the pagination loop in that first part – I think.
if __name__ == '__main__':
try:
driver = prepare_driver(domain)
fill_form(driver, 'Waterberg, South Africa') # my search argument
accommodations_data = scrape_results(driver, 25) # 25 is the maximum of results, higher makes the scraper crash due to the pagination problem
accommodations_data = json.dumps(accommodations_data, indent=4)
with open('booking_data.json', 'w') as f:
f.write(accommodations_data)
finally:
driver.quit()
Below is the JSON output for one search result.
[
{
"name": "Lodge Entabeni Safari Conservancy",
"score": "8.4",
"no_reviews": "41",
"location": "Vosdal Plaas, R520 Marken Road, 0510 Golders Green, South Africa",
"room_types": [
"Tented Chalet - Wildside Safari Camp with 1 game drive",
"Double or Twin Room - Hanglip Mountain Lodge with 1 game drive",
"Tented Family Room - Wildside Safari Camp with 1 game drive"
],
"room_prices": [
"\u20ac 480",
"\u20ac 214",
"\u20ac 650",
"\u20ac 290",
"\u20ac 693"
],
"popular_facilities": [
"1 swimming pool",
"Bar",
"Very Good Breakfast"
]
},
...
]
Related
I'm trying to scrape data from this website: myworkdayjobs link
The data I want to collect are the job advertisemnts and their respective data.
Currently there are 7 jobs active.
On the inspect page I can see the 7 wanted elements all having the same:
li class="css-1q2dra3"
But the page.html.xpath() always returns me an empty list.
The steps I've taken are:
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=1, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
I've also tried multiple other xpaths including the xpath I get when I copy the xpath path from rightclicking the element on the inspect page:
//*[#id="mainContent"]/div/div[2]/section/ul/li[1]/div[1]
/html/body/div/div/div/div[3]/div/div/div[2]/section/ul/li[1]/div[1]
//*[#id="mainContent"]/div/div[2]/section/ul/li[1]
Now, the only time I get results for a li element is when I
cards = page.html.xpath('//li')
Which returns me the li elements at the bottom of the page. But it fully ignores the elements that I want...
I'm not an expert on webscraping or so, but I have made it work with an other careers page before. What am I missing? Why can't I access those elements?
=========================================================
Additional information:
The problem that I experience seems to happen after the section element.
When I
cards = page.html.xpath('//*[#id="mainContent"]/div/div[2]/section/*')
print(cards)
[<Element 'p' data-automation-id='jobFoundText' class=('css-12psxof',)>, <Element 'div' data-automation-id='jobJumpToDetailsContainer' class=('css-14l0ax5',)>, <Element 'div' class=('css-19kzrtu',)>]
Why isn't there no ul element in the list? It's clearly there in the inspect window.
=========================================================
Answer
(Because the answer is in the accepted solution comment)
The page had aparently not fully loaded by the time of the assignement of cards and thus the ul was not there yet.
Adding one more second on the renderer sleep did the trick (sleep=2).
session = HTMLSession()
url = (
'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'
'?locations=91336993fab910af6d6f80c09504c167'
'&jobFamilyGroup=0c40f6bd1d8f10ae43ffaefd46dc7e78'
)
page = session.get(url)
page.html.render(sleep=2, keep_page=True, scrolldown=1)
cards = page.html.xpath("the_xpath_here")
print(cards)
You say you want li elements but your 3 variants of xpath point to div or single li. Try out specific xpath you need '//li[#class="css-1q2dra3"]'
You can try to use their Ajax API to get the Json data about the jobs. For example:
import requests
api_url = (
"https://nvidia.wd5.myworkdayjobs.com/wday/cxs/nvidia/NVIDIAExternalCareerSite/jobs"
)
payload = {
"appliedFacets": {
"jobFamilyGroup": ["0c40f6bd1d8f10ae43ffaefd46dc7e78"],
"locations": ["91336993fab910af6d6f80c09504c167"],
},
"limit": 20,
"offset": 0,
"searchText": "",
}
data = requests.post(api_url, json=payload).json()
print(data)
Prints:
{
"total": 7,
"jobPostings": [
{
"title": "Senior CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/Senior-CPU-Compiler-Engineer_JR1954638",
"locationsText": "7 Locations",
"postedOn": "Posted 18 Days Ago",
"bulletFields": ["JR1954638"],
},
{
"title": "CPU Compiler Engineer",
"externalPath": "/job/UK-Remote/CPU-Compiler-Engineer_JR1954640-1",
"locationsText": "7 Locations",
"postedOn": "Posted 26 Days Ago",
"bulletFields": ["JR1954640"],
},
...and so on.
I have a string in R that I would like to pass to python in order to compute something and return the result back into R.
I have the following which "works" but not as I would like.
The below passes a string from R, to a Python file, uses openAI to collect the text data and then load it back into R.
library(reticulate)
computePythonFunction <- "
def print_openai_response():
import openai
openai.api_key = 'ej-powurjf___OpenAI_API_KEY___HGAJjswe' # you will need an API key
prompt = 'write me a poem about the sea'
response = openai.Completion.create(engine = 'text-davinci-003', prompt = prompt, max_tokens=1000)
#response['choices'][0]['text']
print(response)
"
py_run_string(computePythonFunction)
py$print_openai_response()
library("rjson")
fromJSON(as.character(py$print_openai_response()))
I would like to store the results in R objects - i.e. Here is one output from the python script.
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"text": "\n\nThe sea glitters like stars in the night \nWith hues, vibrant and bright\nThe waves flow gentle, serene, and divine \nLike the sun's most gentle shine\n\nAs the sea reaches, so wide, so vast \nAn adventure awaits, and a pleasure, not passed\nWhite sands, with seaweed green \nForms a kingdom of the sea\n\nConnecting different land and tide \nThe sea churns, dancing with the sun's pride\nAs a tempest forms, raging and wild \nThe sea turns, its colors so mild\n\nA raging storm, so wild and deep \nProtecting the creatures that no one can see \nThe sea is a living breathing soul \nA true and untouchable goal \n\nThe sea is a beauty that no one can describe \nAnd it's power, no one can deny \nAn ever-lasting bond, timeless and free \nThe love of the sea, is a love, to keep"
}
],
"created": 1670525403,
"id": "cmpl-6LGG3hDNzeTZ5VFbkyjwfkHH7rDkE",
"model": "text-davinci-003",
"object": "text_completion",
"usage": {
"completion_tokens": 210,
"prompt_tokens": 7,
"total_tokens": 217
}
}
I am interested in the text generated but I am also interested in the completion_tokens, promt_tokens and total_tokens.
I thought about save the Python code as a script, then pass the argument to it such as:
myPythin.py arg1.
How can I return the JSON output from the model to an R object? The only input which changes/varies in the python code is the prompt variable.
I am trying to scrape the special offers on the steam website using Python and beautiful soup. I am trying to scrape data from mutiple pages using a for loop. I have attached the Python code below. Any help is really appreciated. Thanks in advance.
game_lis = set([])
for page in range(0,4):
page_url = "https://store.steampowered.com/specials#p=" +str(page)+"&tab=TopSellers"
#print(page_url)
steam_games = requests.get(page_url)
soup = BeautifulSoup(steam_games.text, 'lxml')
s_game_offers = soup.findAll('a', class_='tab_item')
print(page_url)
for game in s_game_offers:
title = game.find('div',class_='tab_item_name')
discount = game.find('div',class_='discount_pct')
game_lis.add(title.text)
print(title.text+":"+discount.text)
The page is loaded from different URL via JavaScript, so beautifulsoup doesn't see it. You can use next example how to load different pages:
import requests
from bs4 import BeautifulSoup
api_url = "https://store.steampowered.com/contenthub/querypaginated/specials/TopSellers/render/"
params = {
"query": "",
"start": "0",
"count": "15",
"cc": "SK", # <-- probably change code here
"l": "english",
"v": "4",
"tag": "",
}
for page in range(0, 4):
params["start"] = 15 * page
steam_games = requests.get(api_url, params=params)
soup = BeautifulSoup(steam_games.json()["results_html"], "lxml")
s_game_offers = soup.findAll("a", class_="tab_item")
for game in s_game_offers:
title = game.find("div", class_="tab_item_name")
discount = game.find("div", class_="discount_pct")
print(title.text + ":" + discount.text)
print("-" * 80)
Prints:
F.I.S.T.: Forged In Shadow Torch:-10%
HITMAN 2 - Gold Edition:-85%
NieR:Automata™:-50%
Horizon Zero Dawn™ Complete Edition:-40%
Need for Speed™ Heat:-86%
Middle-earth: Shadow of War Definitive Edition:-80%
Batman: Arkham Collection:-80%
No Man's Sky:-50%
Legion TD 2 - Multiplayer Tower Defense:-20%
NieR Replicant™ ver.1.22474487139...:-35%
Days Gone:-20%
Mortal Kombat 11 Ultimate:-65%
Human: Fall Flat:-66%
Muse Dash - Just as planned:-30%
The Elder Scrolls Online - Blackwood:-50%
--------------------------------------------------------------------------------
The Elder Scrolls Online - Blackwood:-50%
Football Manager 2022:-10%
Age of Empires II: Definitive Edition:-33%
OCTOPATH TRAVELER™:-50%
DRAGON QUEST® XI S: Echoes of an Elusive Age™ - Definitive Edition:-35%
Witch It:-70%
Monster Hunter: World:-34%
NARUTO SHIPPUDEN: Ultimate Ninja STORM 4:-77%
MADNESS: Project Nexus:-10%
Mad Max:-75%
Outer Wilds:-40%
Middle-earth: Shadow of Mordor Game of the Year Edition:-75%
Age of Empires III: Definitive Edition:-33%
Ghostrunner:-60%
The Elder Scrolls® Online:-60%
--------------------------------------------------------------------------------
...
I am working on a Python script using selenium chromedriver to scrape all google search results (link, header, text) off a specified number of results pages.
The code I have seems to only be scraping the first result from all pages after the first page.
I think this has something to do with how my for-loop is set up in the scrape function, but I have not been able to tweak it into working the way I'd like it to. Any suggestions for how to fix/ better approach this appreciated.
# create instance of webdriver
driver = webdriver.Chrome()
url = 'https://www.google.com'
driver.get(url)
# set keyword
keyword = 'cars'
# we find the search bar using it's name attribute value
searchBar = driver.find_element_by_name('q')
# first we send our keyword to the search bar followed by the ent
searchBar.send_keys(keyword)
searchBar.send_keys('\n')
def scrape():
pageInfo = []
try:
# wait for search results to be fetched
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "g"))
)
except Exception as e:
print(e)
driver.quit()
# contains the search results
searchResults = driver.find_elements_by_class_name('g')
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
# Number of pages to scrape
numPages = 5
# All the scraped data
infoAll = []
# Scraped data from page 1
infoAll.extend(scrape())
for i in range(0 , numPages - 1):
nextButton = driver.find_element_by_link_text('Next')
nextButton.click()
infoAll.extend(scrape())
print(infoAll)
You have an indentation problem:
You should to have return pageInfo outside for loop, otherwise you're returning results after first loop execution
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
Like this:
for result in searchResults:
element = result.find_element_by_css_selector('a')
link = element.get_attribute('href')
header = result.find_element_by_css_selector('h3').text
text = result.find_element_by_class_name('IsZvec').text
pageInfo.append({
'header' : header, 'link' : link, 'text': text
})
return pageInfo
I've ran your code and got results:
[{'header': 'Cars (film) — Wikipédia', 'link': 'https://fr.wikipedia.org/wiki/Cars_(film)', 'text': "Cars : Quatre Roues, ou Les Bagnoles au Québec (Cars), est le septième long-métrage d'animation entièrement en images de synthèse des studios Pixar.\nPays d’origine : États-Unis\nDurée : 116 minutes\nSociétés de production : Pixar Animation Studios\nGenre : Animation\nCars 2 · Michel Fortin · Flash McQueen"}, {'header': 'Cars - Wikipedia, la enciclopedia libre', 'link': 'https://es.wikipedia.org/wiki/Cars', 'text': 'Cars es una película de animación por computadora de 2006, producida por Pixar Animation Studios y lanzada por Walt Disney Studios Motion Pictures.\nAño : 2006\nGénero : Animación; Aventuras; Comedia; Infa...\nHistoria : John Lasseter Joe Ranft Jorgen Klubi...\nProductora : Walt Disney Pictures; Pixar Animat...'}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Flash_McQueen', 'text': ''}, {'header': '', 'link': 'https://www.allocine.fr/film/fichefilm-55774/secrets-tournage/', 'text': ''}, {'header': '', 'link': 'https://fr.wikipedia.org/wiki/Martin_(Cars)', 'text': ''},
Suggestions:
Use a timer to control your for loop, otherwise you could be banned by Google due to suspicious activity
Steps:
1.- Import sleep from time: from time import sleep
2.- On your last loop add a timer:
for i in range(0 , numPages - 1):
sleep(5) #It'll wait 5 seconds for each iteration
nextButton = driver.find_element_by_link_text('Next')
nextButton.click()
infoAll.extend(scrape())
Google Search can be parsed with BeautifulSoup web scraping library without selenium, since the data is not being loaded dynamically via JavaScript, and will execute much faster in comparison to selenium as there's no need to render the page and use browser.
In order to get information from all pages, you can use pagination using an infinite while loop. Try to avoid using for i in range() pagination as it is a hardcoded way of doing pagination thus not reliable. If the page number would change (from 5 to 20), pagination will be broken.
Since the while loop is infinite, you need to set the conditions for exiting it, you can make two conditions:
the exit condition will be the presence of a button to switch to the next page (it is not on the last page), the presence can be checked by its CSS selector (in our case - ".d6cvqb a[id=pnnext]")
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
another solution would be to add a limit of pages available for scraping if there is no need to extract all the pages.
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
When trying to request a site, it may think that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request, then the site will assume that you are a user and display the information.
Next step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. The most reliable way is to use rotating proxies, user-agents, and a captcha solver.
Check full code in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "cars", # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10 # page limit for example
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
# condition for exiting the loop when the page limit is reached
if page_num == page_limit:
break
# condition for exiting the loop in the absence of the next page button
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Cars (2006) - IMDb",
"snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.",
"links": "https://www.imdb.com/title/tt0317219/"
},
{
"title": "Cars (film) - Wikipedia",
"snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...",
"links": "https://en.wikipedia.org/wiki/Cars_(film)"
},
{
"title": "Cars - Rotten Tomatoes",
"snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.",
"links": "https://www.rottentomatoes.com/m/cars"
},
other results ...
]
Also you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": "...", # serpapi key from https://serpapi.com/manage-api-key
"engine": "google", # serpapi parser engine
"q": "cars", # search query
"gl": "uk", # country of the search, UK -> United Kingdom
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
page_limit = 10
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"title": result.get("title"),
"snippet": result.get("snippet"),
"link": result.get("link")
})
if page_num == page_limit:
break
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"title": "Rally Cars - Page 30 - Google Books result",
"snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...",
"link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM"
},
{
"title": "Independent Sports Cars - Page 5 - Google Books result",
"snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...",
"link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM"
}
other results...
]
I'm using scrapy to scrape reviews from seek.com.au. I have found this link https://company-profiles-api.cloud.seek.com.au/v1/companies/432306/reviews?page=1 which has data I need encoded in JSON.
The data looks like this :
{
"paging":{
"page":1,
"perPage":20,
"total":825
},
"data":[
{
"timeAgoText":null,
"id":5330561,
"companyName":"Commonwealth Bank of Australia",
"companyRecommended":false,
"salarySummary":"fair",
"salarySummaryDisplayText":"Average",
"jobTitle":"Financial Planning Role",
"title":"Run away, don't walk!",
"pros":"Staff benefits, the programs are very good however IT support is atrocious. There is a procedure for absolutely everything so you aren't left wondering how to do things in the branch network.",
"cons":"Sell, sell, sell! Everything at CBA is about selling. Don't believe the reports that things have changed and performance is based on customer service. They may have on paper but sales numbers are still tracked.",
"yearLeft":"left_2019",
"yearLeftEmploymentStatusText":"former employee",
"yearsWorkedWith":"1_2_years",
"yearsWorkedWithText":"1 to 2 years",
"workLocation":"New South Wales, Australia",
"ratingCompanyOverall":2,
"ratingBenefitsAndPerks":3,
"ratingCareerOpportunity":3,
"ratingExecutiveManagement":1,
"ratingWorkEnvironment":2,
"ratingWorkLifeBalance":1,
"ratingStressLevel":null,
"ratingDiversity":3,
"reviewCreatedAt":"2019-09-11T11:41:10Z",
"reviewCreatedTimeAgoText":"1 month ago",
"reviewResponse":"Thank you for your feedback. At CommBank, we are continually working to ensure our performance metrics are realistic and achievable, so we appreciate your insights, which we will pass on to the Human Resources & Remuneration team. If you have any other feedback that you would like to share, we also encourage you to speak to HR Direct on 1800 989 696.",
"reviewResponseBy":"Employer Brand",
"reviewResponseForeignUserId":1,
"reviewResponseCreatedAt":"2019-10-17T05:13:52Z",
"reviewResponseCreatedTimeAgoText":"a few days ago",
"crowdflowerScore":3.0,
"isAnonymized":false,
"normalizedCfScore":2000.0,
"score":3.0483236,
"roleProximityScore":0.002
},
{
"timeAgoText":null,
"id":5327368,
"companyName":"Commonwealth Bank of Australia",
"companyRecommended":true,
"salarySummary":"below",
"salarySummaryDisplayText":"Low",
"jobTitle":"Customer Service Role",
"title":"Great to start your career in banking; not so great to stay for more than a few years",
"pros":"- Great work culture\n- Amazing colleagues\n- good career progress",
"cons":"- hard to get leave approved\n- no full-time opportunities\n- no staff benefits of real value",
"yearLeft":"still_work_here",
"yearLeftEmploymentStatusText":"current employee",
"yearsWorkedWith":"0_1_year",
"yearsWorkedWithText":"Less than 1 year",
"workLocation":"Melbourne VIC, Australia",
"ratingCompanyOverall":3,
"ratingBenefitsAndPerks":1,
"ratingCareerOpportunity":3,
"ratingExecutiveManagement":2,
"ratingWorkEnvironment":5,
"ratingWorkLifeBalance":3,
"ratingStressLevel":null,
"ratingDiversity":5,
"reviewCreatedAt":"2019-09-11T07:05:26Z",
"reviewCreatedTimeAgoText":"1 month ago",
"reviewResponse":"",
"reviewResponseBy":"",
"reviewResponseForeignUserId":null,
"reviewResponseCreatedAt":null,
"reviewResponseCreatedTimeAgoText":"",
"crowdflowerScore":3.0,
"isAnonymized":false,
"normalizedCfScore":2000.0,
"score":3.0483236,
"roleProximityScore":0.002
},
I have created a dictionary and then tried returning data but only 1 value gets returned
name = 'seek-spider'
allowed_domains = ['seek.com.au']
start_urls = [
'https://www.seek.com.au/companies/commonwealth-bank-of-australia-432306/reviews']
s = str(start_urls)
res = re.findall(r'\d+', s)
res = str(res)
string = (res[res.find("[")+1:res.find("]")])
string_replaced = string.replace("'", "")
start_urls = [
'https://company-profiles-api.cloud.seek.com.au/v1/companies/'+string_replaced+'/reviews?page=1']
def parse(self, response):
result = json.loads(response.body)
detail = {}
for i in result['data']:
detail['ID'] = i['id']
detail['Title'] = i['title']
detail['Pros'] = i['pros']
detail['Cons'] = i['cons']
return detail
I expect the output to have all data but only this is returned :
{'ID': 135413, 'Title': 'Great place to work!', 'Pros': 'All of the above.', 'Cons': 'None that I can think of'}
The dictionary I was creating was erasing my previous data. I created a list before looping and the problem was solved.
def parse(self, response):
result = json.loads(response.body)
res = []
for i in result['data']:
detail = {}
detail['id'] = i['id']
res.append(detail)
return res