Scrapy - incrementing number in a string - python

Again I seem to have a brick wall with this one and I'm hoping somebody would be able to answer it off the top of their head.
Here's an example code below:
def parse_page(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["Details_H1"] = hxs.select('//*[#id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()').extract()
return item
It seems that the #id in the Details_H1 could change. E.G. For a page it could be #id="ctl08_p_ctl17_ctl04_ctl01_ctl00_dlProps and for the next page it's randomly #id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps.
I would like to implement a do until loop equivalent such that the code cycles through the numbers with increments of 1 until the value being yielded by the XPath is non-zero. So for example I could set i=108 and would i=i+1 each time until hxs.select('//*[#id="ctl09_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()').extract() <> []
How would I be able to implement this?
Your help and contribution is greatly appreciated
EDIT 1
Fix addressed by TNT below. Code should read:
def parse_page(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item["Details_H1"] = hxs.select('//*[contains(#id, "_p_ctl17_ctl04_ctl01_ctl00_dlProps")]/tr[1]/td[1]/text()').extract()
return item

The 'natural' XPATH way would be to more generalize your xpath expresssion:
xp = '//*[contains(#id, "_p_ctl17_ctl04_ctl01_ctl00_dlProps")]/tr[1]/td[1]/text()'
item["Details_H1"] = hxs.select(xp).extract()
But I'm groping in the dark. Your xpath expression would probably better begin with something like //table or //tbody
In any case a "do until" would be ugly.

You can try this
i = 108
while True:
item = response.meta['item']
xpath = '//*[#id="ct%d_p_ctl17_ctl04_ctl01_ctl00_dlProps"]/tr[1]/td[1]/text()' %i
item["Details_H1"] = hxs.select(xpath).extract()
if not item["Details_H1"]:
break
i += 1
yield item

Related

Selenium Can't find elements by CLASS_NAME

I'm trying to scrape a website and get every meal_box meal_container row in a list by driver.find_elements but for some reason I couldn't do it. I tried By.CLASS_NAME, because it seemed the logical one but the length of my list was 0. Then I tried By.XPATH, and the length was then 1 (I understand why). I think I can use XPATH to get them one by one, but I don't want to do it if I can handle it in a for loop.
I don't know why the "find_elements(By.CLASS_NAME,'print_name')" works but not "find_elements(By.CLASS_NAME,"meal_box meal_container row")"
I'm new at both web scraping and stackoverflow, so if any other details are needed I can add them.
Here is my code:
meals = driver.find_elements(By.CLASS_NAME,"meal_box meal_container row")
print(len(meals))
for index, meal in enumerate(meals):
foods = meal.find_elements(By.CLASS_NAME, 'print_name')
print(len(foods))
if index == 0:
mealName = "Breakfast"
elif index == 1:
mealName = "Lunch"
elif index == 2:
mealName = "Dinner"
else:
mealName = "Snack"
for index, title in enumerate(foods):
recipe = {}
print(title.text)
print(mealName + "\n")
recipe["name"] = title.text
recipe["meal"] = mealName
Here is the screenshot of the HTML:
It seems Ok but about class name put a dot between characters.
Like "meal_box.meal_container.row" Try this.
meals = driver.find_elements(By.CLASS_NAME,"meal_box.meal_container.row")
Try to use driver.find_element_by_css_selector
It can be because "meal_box meal_container row" is inside of other element. So you should try finding the highest element and look for needed one inside.
root = driver.find_element(By.CLASS_NAME,"row")
meals = root.find_elements(By.CLASS_NAME, "meal_box meal_container row")

Scrapy follow urls based on condition

I am using Scrapy and I want to extract each topic that has at least 4 posts. I have two separate selectors :
real_url_list in order to get the href for each topic
nbpostsintopic_resp to get the numbers of posts
real_url_list = response.css("td.col-xs-8 a::attr(href)").getall()
for topic in real_url_list:
nbpostsintopic_resp = response.css("td.center ::text").get()
nbpostsintopic = nbpostsintopic_resp[0]
if int(nbpostsintopic) > 4:
yield response.follow(topic, callback=self.topic)
ULR : https://www.allodocteurs.fr/forums-et-chats/forums/allergies/allergies-aux-pollens/
Unfortunately, the above does not work as expected, the number of posts seems to not be taken into account. Is there a way to achieve such a condition ?
Thank you in advance.
Your problem is with this line
nbpostsintopic_resp = response.css("td.center ::text").get()
Note that this will always give you the same thing, there is no reference to your topic variable.
Instead, loop through tr selectors and then get the information from them
def parse(self, response):
for row in response.css("tbody > tr"):
nbpostsintopic_resp = row.css("td.center::text").get()
if int(nbpostsintopic_resp) > 4:
response.follow(row.css("td > a")[0], callback=self.topic)

Scrapy download data from links where certain other condition is fulfilled

I am extracting data from Imdb lists and it is working fine. I provide a link for all lists related to an imdb title, the code opens all lists and can pretty extract data what I want.
class lisTopSpider(scrapy.Spider):
name= 'ImdbListsSpider'
allowed_domains = ['imdb.com']
start_urls = [
'https://www.imdb.com/lists/tt2218988'
]
#lists related to given title
def parse(self, response):
#Grab list link section
listsLinks = response.xpath('//div[2]/strong')
for link in listsLinks:
list_url = response.urljoin(link.xpath('.//a/#href').get())
yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})
Now what is the issue, is that I want this code to skip all lists that have more than 50 titles and get data where lists have less than 50 titles.
Problem with it is that list link is in separate block of xpath and number of titles is in another block.
So I tried the following.
for link in listsLinks:
list_url = response.urljoin(link.xpath('.//a/#href').get())
numOfTitlesString = response.xpath('//div[#class="list_meta"]/text()[1]').get()
numOfTitles = int(''.join(filter(lambda i: i.isdigit(), numOfTitlesString)))
print ('numOfTitles' , numOfTitles)
if numOfTitles < 51:
yield scrapy.Request(list_url, callback=self.parse_list, meta={'list_url': list_url})
But it gives me empty csv file. When I try to print numOfTitles in for loop, it gives me result of very first xpath found for all rounds of the loop.
Please suggest a solution for this.
As Gallaecio mentioned, it's just an xpath issue. It's normal you keep getting the same number, because you're executing the exact same xpath to the exact same response object. In the below code we get the whole block (instead of just the part that contains the url), and for every block we get the url and the number of titles.
list_blocks = response.xpath('//*[has-class("list-preview")]')
for block in list_blocks:
list_url = response.urljoin(block.xpath('./*[#class="list_name"]//#href').get())
number_of_titles_string = block.xpath('./*[#class="list_meta"]/text()').get()
number_of_titles = int(''.join(filter(lambda i: i.isdigit(), number_of_titles_string)))

String concatenation with python and xpath

Instead of writing 10+ IF statements, I'm trying to create one IF statement using a variable. Unfortunately, I'm not familiar with how to implement string concatenation for xpath using python. Can anyone teach me how to perform string formatting for the following code segments?
I would greatly appreciate it, thanks.
if page_number == 1:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 2"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
elif page_number == 2:
next_link = browser.find_element_by_xpath('//*[#title="Go to page 3"]')
next_link.click()
page_number = page_number + 1
time.sleep(30)
This answer is not about string concatenation, but about simple problem solution...
Instead of clicking particular link on pagination you can click "Next" button:
pages_number = 10
for _ in range(pages_number):
driver.find_element_by_xpath("//a[#title='Go to next page']").click()
time.sleep(30)
If you need to open specific page you can use below:
required_page = 3
driver.find_element_by_link_text(required_page).click()
P.S. I assumed you are talking about this site
You can use a for-loop
Ex:
for i in range(1, 10):
next_link = browser.find_element_by_xpath('//*[#title="Go to page {0}"]'.format(str(i+1))) #using str.format for page number
next_link.click()
time.sleep(30)

My loop isn't running

I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y
You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)
Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).

Categories

Resources