Scrapy follow urls based on condition - python

I am using Scrapy and I want to extract each topic that has at least 4 posts. I have two separate selectors :
real_url_list in order to get the href for each topic
nbpostsintopic_resp to get the numbers of posts
real_url_list = response.css("td.col-xs-8 a::attr(href)").getall()
for topic in real_url_list:
nbpostsintopic_resp = response.css("td.center ::text").get()
nbpostsintopic = nbpostsintopic_resp[0]
if int(nbpostsintopic) > 4:
yield response.follow(topic, callback=self.topic)
ULR : https://www.allodocteurs.fr/forums-et-chats/forums/allergies/allergies-aux-pollens/
Unfortunately, the above does not work as expected, the number of posts seems to not be taken into account. Is there a way to achieve such a condition ?
Thank you in advance.

Your problem is with this line
nbpostsintopic_resp = response.css("td.center ::text").get()
Note that this will always give you the same thing, there is no reference to your topic variable.
Instead, loop through tr selectors and then get the information from them
def parse(self, response):
for row in response.css("tbody > tr"):
nbpostsintopic_resp = row.css("td.center::text").get()
if int(nbpostsintopic_resp) > 4:
response.follow(row.css("td > a")[0], callback=self.topic)

Related

Scrapy download data from links where certain other condition is fulfilled

I am extracting data from Imdb lists and it is working fine. I provide a link for all lists related to an imdb title, the code opens all lists and can pretty extract data what I want.
class lisTopSpider(scrapy.Spider):
name= 'ImdbListsSpider'
allowed_domains = ['imdb.com']
start_urls = [
'https://www.imdb.com/lists/tt2218988'
]
#lists related to given title
def parse(self, response):
#Grab list link section
listsLinks = response.xpath('//div[2]/strong')
for link in listsLinks:
list_url = response.urljoin(link.xpath('.//a/#href').get())
yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})
Now what is the issue, is that I want this code to skip all lists that have more than 50 titles and get data where lists have less than 50 titles.
Problem with it is that list link is in separate block of xpath and number of titles is in another block.
So I tried the following.
for link in listsLinks:
list_url = response.urljoin(link.xpath('.//a/#href').get())
numOfTitlesString = response.xpath('//div[#class="list_meta"]/text()[1]').get()
numOfTitles = int(''.join(filter(lambda i: i.isdigit(), numOfTitlesString)))
print ('numOfTitles' , numOfTitles)
if numOfTitles < 51:
yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})
But it gives me empty csv file. When I try to print numOfTitles in for loop, it gives me result of very first xpath found for all rounds of the loop.
Please suggest a solution for this.
As Gallaecio mentioned, it's just an xpath issue. It's normal you keep getting the same number, because you're executing the exact same xpath to the exact same response object. In the below code we get the whole block (instead of just the part that contains the url), and for every block we get the url and the number of titles.
list_blocks = response.xpath('//*[has-class("list-preview")]')
for block in list_blocks:
list_url = response.urljoin(block.xpath('./*[#class="list_name"]//#href').get())
number_of_titles_string = block.xpath('./*[#class="list_meta"]/text()').get()
number_of_titles = int(''.join(filter(lambda i: i.isdigit(), number_of_titles_string)))

How to Iterate trough Indeed reviews and find the correspondent job offer, printing the employee review?

Having established already a dynamic search for the offers based on companies generating a link where you use it to search it´s available job reviews done by the previous employees, I´m now faced with the question about coding the part that would let me after having assign job offers and job reviews to a list as well as description to iterate through them and print the correspondent.
It all seems easy to do until you notice that job offers list have a different size than job reviews so I´m on a standstill regarding the following situation.
I´m trying the following code which obviously gives me an error since cargo_revisto_list is longer in length than nome_emprego_list because once you have more reviews than job offers this tends to happen, as well as the opposite.
Lists would be per example, the following:
cargo_revisto_list = ["Business Leader","Sales Manager"]
nome_emprego_list = ["Business Leader","Sales Manager","Front-end Developer"]
opiniao_list = ["Excellent Job","Wonderful managing"]
It would be a question of luck to get them to be exactly the same size.
url = "https://www.indeed.pt/cmp/Novabase/reviews?fcountry=PT&floc=Lisboa"
comprimento_cargo_revisto = len(cargo_revisto_list) #19
comprimento_nome_emprego = len(nome_emprego_list) #10
descricoes_para_cargos_existentes = []
if comprimento_cargo_revisto > comprimento_nome_emprego:
for i in range(len(cargo_revisto_list)):
s = cargo_revisto_list[i]
for z in range(len(nome_emprego_list)):
a = nome_emprego_list[z]
if(s == a): #A Stopping here needs new way of comparing strings
c=opiniao_list[i]
descricoes_para_cargos_existentes.append(c)
elif comprimento_nome_emprego > comprimento_cargo_revisto:
for i in range(len(comprimento_nome_emprego)):
s = nome_emprego_list[i]
for z in range(len(cargo_revisto_list)):
a = cargo_revisto_list[z]
if(s == a) and a!=None:
c = opiniao_list[z]
descricoes_para_cargos_existentes.append(c)
else:
for i in range(len(cargo_revisto_list)):
s = cargo_revisto_list[i]
for z in range(len(nome_emprego_list)):
a = nome_emprego_list[z]
if(s == a):
c = (opiniao_list[i])
descricoes_para_cargos_existentes.append(c)
After solving this issue I would need to get the exact review description about the job reviewed that corresponds to the job offer, so to solve this I would get the index of cargo_revisto_list and use that index to print opiniao_list (job description) that matches the job reviewed since it was added to the list at the same time and order by Beautiful Soup at the scraping moment.

Python script extract from HTML

I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.
The code looks like this and is outputting zero:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for a in soup.select(".chooser-list ul"):
list_entry = a.findAll('li')
relative_link = list_entry[0].find('a')['href']
link = "https://www.teamrankings.com" + relative_link
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeautifulSoup(r.text, "html.parser")
team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
# time.sleep(1)
print total_rank
debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing
The selector
".tr-table datatable scrollable dataTable no-footer tr"
Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.
I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:
".tr-table.datatable.scrollable.dataTable.no-footer tr"
UPDATE: the new selector looks like this:
".tr-table.datatable.scrollable.dataTable.no-footer table"
The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:
<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">
The proper selector remains the one I originally suggested.
The #audiodude's answer is correct though the suggested selector is not working for me.
You don't need to check every single class of the table element. Here is the working selector:
team_rows = soup.select("table.datatable tr")
Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:
rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank) # it is important to convert the row number to int
Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:
links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"]
for a in links_list.select("ul.expand-content li a[href]")]

How to get data from all pages in Github API with Python?

I'm trying to export a repo list and it always returns me information about the 1rst page. I could extend the number of items per page using URL+"?per_page=100" but it's not enough to get the whole list.
I need to know how can I get the list extracting data from page 1, 2,...,N.
I'm using Requests module, like this:
while i <= 2:
r = requests.get('https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'.format(i), auth=('My_user', 'My_passwd'))
repo = r.json()
j = 0
while j < len(repo):
print repo[j][u'full_name']
j = j+1
i = i + 1
I use that while condition 'cause I know there are 2 pages, and I try to increase it in that waym but It doesn't work
import requests
url = "https://api.github.com/XXXX?simple=yes&per_page=100&page=1"
res=requests.get(url,headers={"Authorization": git_token})
repos=res.json()
while 'next' in res.links.keys():
res=requests.get(res.links['next']['url'],headers={"Authorization": git_token})
repos.extend(res.json())
If you aren't making a full blown app use a "Personal Access Token"
https://github.com/settings/tokens
From github docs:
Response:
Status: 200 OK
Link: <https://api.github.com/resource?page=2>; rel="next",
<https://api.github.com/resource?page=5>; rel="last"
X-RateLimit-Limit: 5000
X-RateLimit-Remaining: 4999
You get the links to the next and the last page of that organization. Just check the headers.
On Python Requests, you can access your headers with:
response.headers
It is a dictionary containing the response headers. If link is present, then there are more pages and it will contain related information. It is recommended to traverse using those links instead of building your own.
You can try something like this:
import requests
url = 'https://api.github.com/orgs/xxxxxxx/repos?page{0}&per_page=100'
response = requests.get(url)
link = response.headers.get('link', None)
if link is not None:
print link
If link is not None it will be a string containing the relevant links for your resource.
From my understanding, link will be None if only a single page of data is returned, otherwise link will be present even when going beyond the last page. In this case link will contain previous and first links.
Here is some sample python which aims to simply return the link for the next page, and returns None if there is no next page. So could incorporate in a loop.
link = r.headers['link']
if link is None:
return None
# Should be a comma separated string of links
links = link.split(',')
for link in links:
# If there is a 'next' link return the URL between the angle brackets, or None
if 'rel="next"' in link:
return link[link.find("<")+1:link.find(">")]
return None
Extending on the answers above, here is a recursive function to deal with the GitHub pagination that will iterate through all pages, concatenating the list with each recursive call and finally returning the complete list when there are no more pages to retrieve, unless the optional failsafe returns the list when there are more than 500 items.
import requests
api_get_users = 'https://api.github.com/users'
def call_api(apicall, **kwargs):
data = kwargs.get('page', [])
resp = requests.get(apicall)
data += resp.json()
# failsafe
if len(data) > 500:
return (data)
if 'next' in resp.links.keys():
return (call_api(resp.links['next']['url'], page=data))
return (data)
data = call_api(api_get_users)
First you use
print(a.headers.get('link'))
this will give you the number of pages the repository has, similar to below
<https://api.github.com/organizations/xxxx/repos?page=2&type=all>; rel="next",
<https://api.github.com/organizations/xxxx/repos?page=8&type=all>; rel="last"
from this you can see that currently we are on first page of repo, rel='next' says that the next page is 2, and rel='last' tells us that your last page is 8.
After knowing the number of pages to traverse through,you just need to use '=' for page number while getting request and change the while loop until the last page number, not len(repo) as it will return you 100 each time.
for e.g
i=1
while i <= 8:
r = requests.get('https://api.github.com/orgs/xxxx/repos?page={0}&type=all'.format(i),
auth=('My_user', 'My_passwd'))
repo = r.json()
for j in repo:
print(repo[j][u'full_name'])
i = i + 1
link = res.headers.get('link', None)
if link is not None:
link_next = [l for l in link.split(',') if 'rel="next"' in l]
if len(link_next) > 0:
return int(link_next[0][link_next[0].find("page=")+5:link_next[0].find(">")])

Scrapy - incrementing number in a string

Again I seem to have a brick wall with this one and I'm hoping somebody would be able to answer it off the top of their head.
Here's an example code below:
def parse_page(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["Details_H1"] = hxs.select('//*[#id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()').extract()
return item
It seems that the #id in the Details_H1 could change. E.G. For a page it could be #id="ctl08_p_ctl17_ctl04_ctl01_ctl00_dlProps and for the next page it's randomly #id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps.
I would like to implement a do until loop equivalent such that the code cycles through the numbers with increments of 1 until the value being yielded by the XPath is non-zero. So for example I could set i=108 and would i=i+1 each time until hxs.select('//*[#id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()').extract() <> []
How would I be able to implement this?
Your help and contribution is greatly appreciated
EDIT 1
Fix addressed by TNT below. Code should read:
def parse_page(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["Details_H1"] = hxs.select('//*[contains(#id, "_p_ctl17_ctl04_ctl01_ctl00_dlProps")]/tr[1]/td[1]/text()').extract()
return item
The 'natural' XPATH way would be to more generalize your xpath expresssion:
xp = '//*[contains(#id, "_p_ctl17_ctl04_ctl01_ctl00_dlProps")]/tr[1]/td[1]/text()'
item["Details_H1"] = hxs.select(xp).extract()
But I'm groping in the dark. Your xpath expression would probably better begin with something like //table or //tbody
In any case a "do until" would be ugly.
You can try this
i = 108
while True:
item = response.meta['item']
xpath = '//*[#id="ct%d_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()' %i
item["Details_H1"] = hxs.select(xpath).extract()
if not item["Details_H1"]:
break
i += 1
yield item

Categories

Resources