my Scrapy crawler is working fine, currently he is crawling some tables, but on some website there are not all information on hand which I like to insert into my mysql table.
So I thought about adding them myself, because on those websites the information is for those fields the same, but I am not sure how to populate them in the spider.
Sure, I could determine the length of one of the lists in the pipeline and then use a while loop to add for example USA in the item['country'] list but I want to do the same in the spider.
I would apppreciate some help, thank you.
Current spider code for populating lists:
def parse(self, response):
for sel in response.xpath('//div[#class="pagecontainer"]'):
item = EbayItem()
item['id'] = sel.xpath('div[2]/text()[2]').extract()
item['user'] = sel.xpath('tr/td[2]/text()[1]').extract()
item['string'] = sel.xpath ('tr/td[2]/a/text()').extract()
item['state'] = sel.xpath('tr/td[3]/b[3]/text()').extract()
item['country'] = sel.xpath('tr/td[3]/b[1]/text()').extract()
item['weight'] = sel.xpath('tr/td[3]/b[2]/text()').extract()
item['position'] = sel.xpath('tr/td[4]/text()').re(r'[0-9,\-]+')
item['old'] = sel.xpath('tr/td[5]/text()').extract()
item['datetime'] = sel.xpath('tr/td[6]/text()').re('[0-9]{2}.[0-9]{2}.[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
yield item
Greetings
P.Halmsich
You want to add things in MySQL. This means that your fields shouldn't be arrays (e.g. ['my-value']) but scalars (e.g. 'my-value'). The easiest way to do this is by using extract_first() instead of extract().
extract_first() allows you to set default values like this: .extract_first(default='my-default-value') or just .extract_first('my-default-value')
Cheers
You can always check the scraped item for empty results using if-else statement. Try the code below:
def parse(self, response):
for sel in response.xpath('//div[#class="pagecontainer"]'):
item = EbayItem()
item['id'] = sel.xpath('div[2]/text()[2]').extract()
item['user'] = sel.xpath('tr/td[2]/text()[1]').extract()
item['string'] = sel.xpath ('tr/td[2]/a/text()').extract()
item['state'] = sel.xpath('tr/td[3]/b[3]/text()').extract()
item['country'] = sel.xpath('tr/td[3]/b[1]/text()').extract()
if item['country'] == []:
item['country'] = 'USA'
item['weight'] = sel.xpath('tr/td[3]/b[2]/text()').extract()
item['position'] = sel.xpath('tr/td[4]/text()').re(r'[0-9,\-]+')
item['old'] = sel.xpath('tr/td[5]/text()').extract()
item['datetime'] = sel.xpath('tr/td[6]/text()').re('[0-9]{2}.[0-9]{2}.[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
yield item
Related
I have a ScraPy Code that is running in shell, but when I try to export it to csv, it returns an empty file. It exports data when I do not go into a link and try to parse the description, but once I add the extra method of parsing the contents, it fails to work. Here is the code:
class MonsterSpider(CrawlSpider):
name = "monster"
allowed_domains = ["jobs.monster.com"]
base_url = "http://jobs.monster.com/v-technology.aspx?"
start_urls = [
"http://jobs.monster.com/v-technology.aspx"
]
for i in range(1,5):
start_urls.append(base_url + "page=" + str(i))
rules = (Rule(SgmlLinkExtractor(allow=("jobs.monster.com",))
, callback = 'parse_items'),)
def parse_items(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="col-xs-12"]')
#items = []
for site in sites.xpath('.//article[#class="js_result_row"]'):
item = MonsterItem()
item['title'] = site.xpath('.//span[#itemprop = "title"]/text()').extract()
item['company'] = site.xpath('.//span[#itemprop = "name"]/text()').extract()
item['city'] = site.xpath('.//span[#itemprop = "addressLocality"]/text()').extract()
item['state'] = site.xpath('.//span[#itemprop = "addressRegion"]/text()').extract()
item['link'] = site.xpath('.//a[#data-m_impr_a_placement_id= "jsr"]/#href').extract()
follow = ''.join(item["link"])
request = Request(follow, callback = self.parse_dir_contents)
request.meta["item"] = item
yield request
#items.append(item)
#return items
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = site.xpath('.//div[#itemprop = "description"]/text()').extract()
return item
Taking out the parse_dir_contents and uncommenting the empty "lists" list and "append" code was the original code.
Well, as #tayfun suggests you should use response.xpath or define the site variable.
By the way, you do not need to use sel = Selector(response). Responses come with the xpath function, there is no need to cover it into another selector.
However the main problem is that you restrict the domain of the spider. You define allowed_domains = ["jobs.monster.com"] however if you look at the URL to follow of your custom Request you can see that they are something like http://jobview.monster.com/ or http://job-openings.monster.com. In this case your parse_dir_contents is not executed (the domain is not allowed) and your item does not get returned so you won't get any results.
Change allowed_domains = ["jobs.monster.com"] to
allowed_domains = ["monster.com"]
and you will be fine and your app will work and return items.
You have an error in your parse_dir_contents method:
def parse_dir_contents(self, response):
item = response.meta["item"]
item['desc'] = response.xpath('.//div[#itemprop=description"]/text()').extract()
return item
Note the use of response. I don't know where you got site that you are currently using from.
Also, try to provide the error details when you post a question. Writing "it fails to work" doesn't say much.
So I have scrapy working really well. It's grabbing data out of a page, but the problem I'm running into is that sometimes the page's table order is different. For example, the first page it gets to:
Row name Data
Name 1 data 1
Name 2 data 2
The next page it crawls to might have the order completely different. Where Name 1 was the first row, any other page it might be the 3rd, or 4th etc. The row names are always the same. I was thinking of doing this possibly 1 of 2 different ways, I'm not sure which will work or which is better.
First option, use some if statements to find the row I need, and then grab the following column. This seems a little messy but could work.
Second option, grab all the data in the table regardless of order and put it in a dict. This way, I can grab the data I need based on row name. This seems like the cleanest approach.
Is there a 3rd option or a better way of doing either?
Here's my code in case it's helpful.
class pageSpider(Spider):
name = "pageSpider"
allowed_domains = ["domain.com"]
start_urls = [
"http://domain.com/stuffs/results",
]
visitedURLs = Set()
def parse(self, response):
products = Selector(response).xpath('//*[#class="itemCell"]')
for product in products:
item = PageScraper()
item['url'] = product.xpath('div[2]/div/a/#href').extract()[0]
urls = Set([product.xpath('div[2]/div/a/#href').extract()[0]])
print urls
for url in urls:
if url not in self.visitedURLs:
request = Request(url, callback=self.productpage)
request.meta['item'] = item
yield request
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]')
item = response.meta['item']
for spec in specs:
item['make'] = spec.xpath('fieldset[1]/dl[1]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['model'] = spec.xpath('fieldset[1]/dl[4]/dd/text()').extract()[0].encode('utf-8', 'ignore')
item['price'] = spec.xpath('fieldset[2]/dl/dd/text()').extract()[0].encode('utf-8', 'ignore')
yield item
The xpaths in productpage can contain data that doesn't correspond to what I need, because the order changed.
Edit:
I'm trying the dict approach and I think this is the best option.
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item
This seems like the best and cleanest approach (using dict)
def productpage(self, response):
specs = Selector(response).xpath('//*[#id="Specs"]/fieldset')
itemdict = {}
for i in specs:
test = i.xpath('dl')
for t in test:
itemdict[t.xpath('dt/text()').extract()[0]] = t.xpath('dd/text()').extract()[0]
item = response.meta['item']
item['make'] = itemdict['Brand']
yield item
I have a list of keywords stored in a csv which I require to be searched for at different domains and scrape information.
I've tried to instruct my spider to follow them in the order of : { example1.com, example2.com, example3.com and example4.com }
The Spider only enters the next domain if it does not find a match in the previous domain. If a match for the keywords has been found in any of those domains, the next keyword from my csv is then picked next search restarts from example1.com
Importantly, I also require that particular keyword which is picked from the csv to be stored in one of the items fields.
So far my code is:
item = ExampleItem()
f = open("InputKeywords.csv")
csv_file = csv.reader(f)
productname_list = []
for row in csv_file:
productname_list.append(row[1])
class MySpider(CrawlSpider):
name = "test1"
allowed_domains = ["example1.com", "example2.com", "example3.com"]
def start_requests(self):
for keyword in productname_list:
item ["Product_Name"] = keyword #loading the searched keyword in my output
request=Request("http://www.example1.com/search?noOfResults=20&keyword="+str(keyword),self.Example1)
yield request
if item ["Example1"] == "No Image Found on Example1":
request=Request("www.Example2.in/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords="+str(keyword),self.Example2)
yield request
if item ["Example2"] == "No Image Found on Example3":
request=Request("http://www.Example3.com/search?noOfResults=20&keyword="+str(keyword),self.Example3):
yield request
def Example1(self,response):
sel = Selector(response)
result = response.xpath("//div[#class='hoverProductImage product-image '][1]/a/#href") #Checking if the Search Term Exists on Domain
if result:
request = Request(result.extract()[0],callback=self.Example1Items) #For Parsing Information if search keyword found
request.meta["item"] = item
return request
else:
item ["Example1"] = "No Image Found at Example1"
return item
def Example1Items(self,response):
sel = Selector(response)
item = response.meta['item']
item ["Example1"] = sel.xpath("//meta[#name='og_image']/#content").extract()
return item
def Example2(self,response):
sel = Selector(response)
result= response.xpath("//div[#class='a-row a-spacing-small'][1]/a/#href")
if result:
request = Request(result.extract()[0],callback=self.Example2Items)
request.meta["item"] = item
return request
else:
item ["Example2"] = "No Image Found at Example2"
return item
def Example2Items(self,response):
sel = Selector(response)
item = response.meta['item']
item ["Example2"] = sel.xpath("//div[#class='ap_content']/div/div/div/img/#src").extract()
return item
----CODE FOR EXAMPLE 3 and EXAMPLE 3 Items----
My code is far from correct but the first error I'm facing is that my keywords are not being stored in the same order as that of the input csv.
I also am not being able to execute the logic for example2 or example 3 searches based on the not found condition.
Any help would be appreciated.
Basically I need my output which I'll be storing in a csv to be look like this:
{
"Keyword1", "Example1Found","","",
"Keyword2", "No Image Found at Example1","No Image Found at Example2","Example3Found",
"Keyword3", "No Image Found at Example1","Example2Found","",
}
I've built a crawler using scrapy to crawl into a sitemap and scrape required components from all the links in the sitemap.
class MySpider(SitemapSpider):
name = "functie"
allowed_domains = ["xyz.nl"]
sitemap_urls = ["http://www.xyz.nl/sitemap.xml"]
def parse(self, response):
item = MyItem()
sel = Selector(response)
item['url'] = response.url
item['h1'] = sel.xpath("//h1[#class='no-bd']/text()").extract()
item['jobtype'] = sel.xpath('//input[#name=".Keyword"]/#value').extract()
item['count'] = sel.xpath('//input[#name="Count"]/#value').extract()
item['location'] = sel.xpath('//input[#name="Location"]/#value').extract()
yield item
The item['location'] can have null values at some cases. In that particular case i want to scrape other component and store it in item['location'].
The code i've tried is:
item['location'] = sel.xpath('//input[#name="Location"]/#value').extract()
if not item['location']:
item['location'] = sel.xpath('//a[#class="location"]/text()').extract()
But it doesn't checks the if-condition and returns empty if value is empty in the input field for location. Any help would be highly useful.
You may wish to check the length of item['location'] instead.
item['location'] = sel.xpath('//input[#name="Location"]/#value').extract()
if len(item['location']) < 1:
item['location'] = sel.xpath(//a[#class="location"]/text()').extract()')
Regardless, have you considered combining the two xpaths with a |?
item['location'] = sel.xpath('//input[#name="Location"]/#value | //a[#class="location"]/text()').extract()'
Try this approach:
if(item[location]==""):
item['location'] = sel.xpath('//a[#class="location"]/text()').extract()
I think what you are trying to achieve is best solved with a custom item pipeline.
1) Open pipelines.py and check your desired if condition within a Pipeline class:
class LocPipeline(object):
def process_item(self, item, spider):
# check if key "location" is in item dict
if not item.get("location"):
# if not, try specific xpath
item['location'] = sel.xpath('//a[#class="location"]/text()').extract()
else:
# if location was already found, do nothing
pass
return item
2) The next step is to add the custom LocPipeline() to your settings.py file:
ITEM_PIPELINES = {'myproject.pipelines.LocPipeline': 300}
Adding the custom pipeline to your settings, scrapy will automatically call the LocPipeline().process_item() after MySpider().parse() and search for the alternative XPath if no location is found yet.
I have an item which will be filled in each of the parse function. I want to return updated item after completion of parsing. Here is my scenario:
My Item class:
class MyItem(Item):
name = Field()
links1 = Field()
links2 = Field()
I have multiple urls to crawl after login:
in parse function, I'm doing:
for url in urls:
yield Request(url=url, callback=self.get_info)
In get_info, I will be extracting 'name' and 'links' in each response:
item = MyItem()
item['name'] = hxs.select("//title/text()").extract()
links = []
link = {}
for data in json_parsed_from_response:
link['name'] = data.get('name')
link['url'] = data.get('url')
links.append(link)
item['links1] = links
#similarly, item['links2'] is created.
Now, I want to go through each of the url in each of the item['links1] and item['links2'] as(these loops are inside get_info):
for link in item['links1']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
for link in item['links2']:
request = Request(url= link['url'], callback=self.get_status)
request.meta['link'] = link
yield request
# Where do I return item, can't return item inside generator
def get_status(self, response):
link = response.meta['link']
if "good" in response.body:
link['status'] = 'good'
else:
link['status'] = 'bad'
# Changes made here, will be reflected in item?
# Also, I can't return item from here. Multiple items will be returned.
I can't figure out from where item has to be returned and it should have all the updated data.
Sorry, but unless you give out some more details I can't understand the design of your code and therefore I can't help... The best suggestion I have is to create a list of *MyItem*s and append each item you create to that list. The values should change as you change them. So you should be able to iterate over the list and see the updated items.