Python: Scrapy Gathering All Text of Selectors Children

Python: Scrapy Gathering All Text of Selectors Children - python

I'm trying to scrape the descriptions of ebay listings, and was approaching it with this:
def parse_description(self, response):
description = response.css('div#ds_div*::text').get()
yield {
"description": description
}
The idea was to grab the text of all the tags that are under .css('div#ds_div')
However I'm getting this as an error:
"Expected selector, got %s" % (peek,))
File "<string>", line None
cssselect.parser.SelectorSyntaxError: Expected selector, got <DELIM '*' at 10>
Example URL I am trying to scrape: https://www.ebay.co.uk/itm/Vintage-Toastmaster-Chrome-Toaster-Model-D182-4-Slice-Wide-Slot-Nos/114677725765?hash=item1ab3533a45:g:ui8AAOSw-jpgBbFS
Where am I going wrong?

The error refers to the selector not being valid:
div#ds_div*::text
If you put a space in between the div#ds_div and * it is valid as you've also mentioned in the comments.
From looking at the link another problem is that the text you're trying to retrieve is inside an iframe with id desc_ifr.
If you want to scrape the content inside this iframe look at the src attribute of the iframe and scrape this url instead of the url in your question. Then you can do this:
response.css('div#ds_div p::text').get()

Related

Scraping an onclick value in BeautifulSoup in Pandas

For class, we've been asked to scrape the North Koren News Agency's website: http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf
The question asks to scrape the onclick values for the website. I've tried solving this in two different ways: by navigating the DOM tree. And by building a regex within a lop to systematically pull them out. I've failed on both counts.
Attempt1:
onclick_soup = soup_doc.find_all('a', class_='titlebet')[0]
onclick_soup
Output:
<a class="titlebet" href="#this" onclick='fn_showArticle("AR0140322",
"", "NT00", "L")'>경애하는 최고령도자 <nobr><strong><font
style="font-size:10pt;">김정은</font></strong></nobr>동지께서 라오스인민혁명당 중앙위원회
총비서인 라오스인민민주주의공화국 주석에게 축전을 보내시였다</a>
Attempt2:
regex_for_onclick_soup = r"onclick='(.*?)\(" onclick_value_soup =
soup_doc.find_all('a', class_='titlebet') for onclick_value in
onclick_value_soup: value =
re.findall(regex_for_onclick_value,onclick_value) print(onclick_value)
Attempt2 results in a TypeError
I'm doing this in pandas. Any guidance would be helpful.

You can simply iterate over every element tag in your html and check for the onclick event.
page= requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf')
soup = BeautifulSoup(page.content, 'lxml')
for tag in soup.find_all():
on_click = tag.get('onclick')
if on_click:
print(on_click)
Note that when using find_all() whithout any argument it will retrieve every tag. Then we use this tags to search for a onclick that is not None and print it out.
Outputs:
fn_convertLanguage('kor')
fn_convertLanguage('eng')
fn_convertLanguage('chn')
fn_convertLanguage('rus')
fn_convertLanguage('spn')
fn_convertLanguage('jpn')
GotoLogin()
register()
evalSearch()
...

scrapy-based crawler not extracting content within <p> tags

I have a custom crawler that scrapes news articles. For the most part it works, however, when adding new urls, it's sometimes hard to figure out what css selectors to use to get the content I want. Below is the code of what i'm working on.
# -*- coding: utf-8 -*-
""" Script to crawl Article from shttps://mycbs4.com
"""
try:
from crawler import BaseCrawler
except:
from __init__ import BaseCrawler
class Cmycbs4Crawler(BaseCrawler):
start_urls = [
'https://mycbs4.com/search?find=cannabis',
'https://mycbs4.com/search?find=marijuana',
'https://mycbs4.com/search?find=cbd',
'https://mycbs4.com/search?find=thc',
'https://mycbs4.com/search?find=hemp'
]
source_id = 'mycbs4'
config_selectors = {
# Css selector on articles page (the page list many articles)
'POST_URLS': '.sd-main a::attr(href)',
#'NEXT_PAGE_URL': '.pager-next > a::attr(href)', # default
# Css selector on article's detail page (the page display full content of article)
'ARTICLE_CONTENT': '#js-Story-Content-0 > p',
}
if __name__ == "__main__":
crawler = Cmycbs4Crawler()
crawler.run()
The crawler should crawl the urls and populate everything back into a DB. It scrapes everything except the content.
I've tried the follow selectors
'#js-Story-Content-0 > p',
.StoryText_storyText__1uZ3 > p'
#js-Story-Content-0 .StoryText_storyText__1uZ3 > p
None of them leads to scraped content from the article. So, i'm not sure what i'm doing wrong.
Below is a screenshot of the content/p tags i'm trying to scrape
Any help would be greatly appreciated

Your content lives in <script data-prerender="facade" type="application/json">, which is great because you don't have to go spelunking around in the HTML to parse the information you want, you can use json.loads instead
BTW, it's a dead giveaway when you see a class name of js-Story-Content-0 and you cannot find any of those <blockquote> elements in the page source; the page source is not equal to the page DOM and Scrapy always sees only the page source, not the DOM.

Find the right selector css to crawl a webpage on scrapy

I'm trying to crawl this webpage "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas" to extract the products name but I can't find the right selector, even for the price, h1 or the title! I tried :
response.css(".shelfProductTile-descriptionLink") #for the name product
response.css(".price-cents") # for the price
response.css(".tileList-title") # for the title
How can I proceed?

Content is dynamically loaded from a POST xhr returning json you can find in network tab of browser.
Request goes to:
https://www.woolworths.com.au/apis/ui/browse/category
Payload:
{"categoryId":"1_9573995","pageNumber":1,"pageSize":24,"sortType":"TraderRelevance","url":"/shop/browse/drinks/cordials-juices-iced-teas/iced-teas","location":"/shop/browse/drinks/cordials-juices-iced-teas/iced-teas","formatObject":"{\"name\":\"Iced Teas\"}","isSpecial":False,"isBundle":False,"isMobile":False,"filters":"null"}
with response in scrapy use:
json.loads(response.body_as_unicode())

Python/Scrapy scraping from Techcrunch

I am trying to build a spider to scrape some Data from the website Techcrunch - Heartbleed search
my tought was to give a tag when executing the spider from the command line (example: Heartbleed). The spider should then search trough all the associated search results, open each link and get the data contained within.
import scrapy
class TechcrunchSpider(scrapy.Spider):
name = "tech_search"
def start_requests(self):
url = 'https://techcrunch.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + '?s=' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
pass
this code can be executed with : scrapy crawl tech_search -s DOWNLOAD_DELAY=1.5 -o tech_search.jl -a tag=EXAMPLEINPUT"
Getting the data from the individual pages is not the problem, but actually getting the url to them is(from the search page linked above):
the thing is , when looking at the source Html file (Ctrl + u) of the Search site(link above), then i cant find anything about the searched elements(example : "What Is Heartbleed? The Video"). Any suggestions how to obtain these elements?

I suggest that you define your scrapy class along the lines shown in this answer but using the PhantomJS selenium headless browser. The essential problem is that when scrapy downloads those pages it uses javascript code to build the HTML (DOM) that you see but cannot access via the route you have chosen.

how to extract data from autocomplete box with selenium python

I am trying to extract data from a search box, you can see a good example on wikipedia
This is my code:
driver = webdriver.Firefox()
driver.get(response.url)
city = driver.find_element_by_id('searchInput')
city.click()
city.clear()
city.send_keys('a')
time.sleep(1.5) #waiting for ajax to load
selen_html = driver.page_source
#print selen_html.encode('utf-8')
hxs = HtmlXPathSelector(text=selen_html)
ajaxWikiList = hxs.select('//div[#class="suggestions"]')
items=[]
for city in ajaxWikiList:
item=TestItem()
item['ajax'] = city.select('/div[#class="suggestions-results"]/a/#title').extract()
items.append(item)
print items
Xpath expression is ok, I checked on a static page. If I uncomment the line that prints out scrapped html code the code for the box shows at the end of the file. But for some reason I can't extract data from it with the above code? I must miss something since I tried 2 different sources, wikipedia page is just another source where I can't get these data extracted.
Any advice here? Thanks!

Instead of passing the .page_source which in your case contains an empty suggestions div, get the innerHTML of the element and pass it to the Selector:
selen_html = driver.find_element_by_class_name('suggestions').get_attribute('innerHTML')
hxs = HtmlXPathSelector(text=selen_html)
suggestions = hxs.select('//div[#class="suggestions-results"]/a/#title').extract()
for suggestion in suggestions:
print suggestion
Outputs:
Animal
Association football
Arthropod
Australia
AllMusic
African American (U.S. Census)
Album
Angiosperms
Actor
American football
Note that it would be better to use selenium Waits feature to wait for the element to be accessible/visible, see:
How can I get Selenium Web Driver to wait for an element to be accessible, not just present?
Selenium waitForElement
Also, note that HtmlXPathSelector is deprecated, use Selector instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Scrapy Gathering All Text of Selectors Children - python

Related

Scraping an onclick value in BeautifulSoup in Pandas

scrapy-based crawler not extracting content within <p> tags

Find the right selector css to crawl a webpage on scrapy

Python/Scrapy scraping from Techcrunch

how to extract data from autocomplete box with selenium python

Categories

Resources