Cleaning data scraped using Scrapy - python

I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples:
Example 1 – removing certain text
Example 2 – removing/replacing unwanted characters
Example 3 –splitting comma separated text
Example 1 data looks like:
Text I want,Text I don’t want
Using the following code:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract()
Example 2 data looks like:
 - but I want to change this to £
Using the following code:
' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()
Example 3 data looks like:
Item 1,Item 2,Item 3,Item 4,Item 4,Item5 – ultimately I want to split
this into separate columns in a CSV file
Using the following code:
' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()
I have tried using str.replace(), but can’t seem to get that to work, e.g:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want",""))
I am looking into this but what appreciate if anyone could point me in the right direction!
Code below:
import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product
class QuotesSpider(scrapy.Spider):
name = "quotes_product"
start_urls = [
'http://www.unitestudents.com/',
]
# Step 1
def parse(self, response):
for city in response.xpath('//select[#id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)
# Step 2
def parse_citypage(self, response):
for url in response.xpath('//div[#class="property-header"]/h3/span/a/#href').extract(): #Select for each property the url
yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)
# Step 3
def parse_unitpage(self, response):
for final in response.xpath('//div/div/div[#class="content__btn"]/a/#href').extract(): #Select final page for data scrape
yield scrapy.Request(response.urljoin(final), callback=self.parse_final)
#Step 4
def parse_final(self, response):
unitTypes = response.xpath('//html/body/div').extract()
for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
l = ItemLoader(item=Product(), response=response)
l.add_xpath('area_name', '//div/ul/li/a/span/text()')
l.add_xpath('type', '//div/div/div/h1/span/text()')
l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
return l.load_item()
However, I'm getting the following?
value = self.item.fields[field_name].get(key, default)
KeyError: 'type'

You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.
I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:
'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]
Edit: Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:
"A,B,C".split(",") # returns [ "A", "B", "C" ]
In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.
If you want something more sophisticated than splitting on each comma, you can use python's csv library.

It would be much easier to provide a more specific answer if you would have provided your spider and item definitions. Here are some generic guidelines.
If you want to keep things modular and follow the Scrapy's suggest project architecture and separation of concerns, you should be cleaning and preparing your data for further export via Item Loaders with input and output processors.
For the first two examples, MapCompose looks like a good fit.

Related

Simple way to change scrapy .getall() delimiter

I'm running a basic scrapy crawler and I can't seem to find any documentation within scrapy that allows me to change the delimiter of a .getall(). The default appears to be comma separated, but I'm assuming this might cause some errors in data importing elsewhere.
Ideally, I want the exported csv to be comma separated, but the getall() data is pipe or semi-colon separated. I would prefer to fix this efficiently within the scrapy script. For example, say the bit containing the .getall() is
def entry_parse(self, response):
for entry in response.xpath("//tbody[#class='entry-grid-body infinite']//td[#class]"):
yield {'entry_labels': entry.xpath(".//div[#class='entry-labels']/span/text()").getall()}
Ideally, it would be nice to be able pass such an argument into getall() or something similar, but I can't seem to find any documentation allowing that. Any ideas would be helpful! Thanks.
This is not really a problem of scrapy. Since the .getall() method returns a list and the repr of lists have commas by default
>>>repr(["a","b"])
"['a', 'b']"
you can use json.dumps and change the delimiter before yielding the item using the separators argument
import json
def entry_parse(self, response):
for entry in response.xpath("//tbody[#class='entry-grid-body infinite']//td[#class]"):
yield {
'entry_labels': json.dumps(
entry.xpath(".//div[#class='entry-labels']/span/text()").getall()
, separators=("|", ":")
)
}

regex that access json data from javascript html tag with scrapy

I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that
I found the response.css to the desired tag which result looks like this in scrapy shell:
response.css('div.rich-snippet script').get()
'<script type="application/ld+json">{\n some json data with newline chars \n }\n ]\n}</script>'
I need everything between {} but, so I tried regex to do it, like this:
response.css('div.rich-snippet script').re(r'\{[^}]*\}')
this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list
I tried more but always the same results, an empty list
.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...
so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible
import scrapy
import json
class SomeSpider(scrapy.Spider):
name = 'test'
start_urls = [
'url'
]
def parse(self, response, **kwargs):
json_file = response.css('div.rich-snippet script').get()
yield json.loads(json_file)
yet again, an empty result
Pls help me to understand, thanks.
Your css selector should specify that you only want the part inside the tag, that is should be ::text, so your code becomes:
def parse(self, response, **kwargs):
json_file = response.css('div.rich-snippet script::text')
yield json.loads(json_file)
You might also want to have a look at:
https://github.com/scrapinghub/extruct
It might better fit parsing ld+json
You could take the response as a string and use a recursive regex on. Recursion is not supported by the original re module but by the newer regex one.
That said, a possible approach could be:
import regex
# code before
some_json_string = response.css('div.rich-snippet script').get()
match = regex.search(r'\{(?:[^{}]*|(?R))+\}', some_json_string)
if match:
relevant_json = match.group(0)
# process it further here
See a demo on regex101.com for the expression.
Edit:
It seems that ::text is supported, so better use this answer instead.

Scrape a webpage using scrapy into tab-delimited format

I would like to scrape and parse the data on these two pages: here and here into a tab-delimited format using scrapy. I did these commands:
scrapy shell
fetch("https://www.drugbank.ca/drugs/DB04899")
print response.text
My two question:
1. for example, for this page, when I type:
response.css(".sequence::text").extract()
[u'>DB04899: Natriuretic peptides B\nSPKMVQGSGCFGRKMDRISSSSGLGCKVLRRH']
But then when I type:
>>> response.css(".synonyms::text").extract()
[]
>>> response.css(".Synonyms::text").extract()
[]
But you can see that there are synonyms listed on the webpage and so the output should not be empty. Can someone demonstrate what I'm doing wrong? (I also tried other tags such as synonym, Synonym) etc.
When I type: response.css(".targets::text").extract(), the output is [u'Targets (3)']. I'm wondering how I can actually parse the data within this list, but I guess this is related to not using the right tags and question 1 above.
This is a vague question/advanced for me at the minute, is it possible to just scrape the whole page in one go, instead of having to know each individual tag? So my output would be a dictionary called 'identification' with Name, accession number, type etc as keys. Then a dictionary called pharmacology with indication, structured indication etc as keys, then another dictionary called interactions, and another called pharmaeconomics etc, one dictionary per page section?
Thanks
There are really no elements with synonyms or Synonyms class attribute value on the page.
You can get to the synonyms by "going to the right" of the dt element with the "Synonyms" text using following-sibling:
In [2]: response.xpath("//dt[. = 'Synonyms']/following-sibling::dd/ul/li/text()").extract()
Out[2]:
['BNP',
'Brain natriuretic peptide 32',
'Natriuretic peptides B',
'Nesiritide recombinant']

Splitting a Scrapy element among multiple CSV rows

I've been working on something that I think should be relatively easy but I keep hitting my head against a wall. I've tried multiple similar solutions from stackoverflow and I've improved my code but still stuck on the basic functionality.
I am scraping a web page that returns an element (genre) that is essential a list of genres:
Mystery, Comedy, Horror, Drama
The xpath returns perfectly. I'm using a Scrapy pipeline to output to a CSV file. What I'd like to do is create a separate row for each item in the above list along with the page url:
"Mystery", "http:domain.com/page1.html"
"Comedy", "http:domain.com/page1.html"
No matter what I try I can only output:
"Mystery, Comedy, Horror, Drama", ""http:domain.com/page1.html"
Here's my code:
def parse_genre (self, response):
for item in [i.split (',') for i in response.xpath ('//span [contains (#class, "genre")]/text()').extract()]:
sg = ItemLoader (item=ItemGenre (), response=response)
sg.add_value ('url', response.url)
sg.add_value ('genre', item, MapCompose(str.strip))
yield sg.load_item ()
This is called from the main parse routine for the spider. That all functions correctly. (I have two items on each web page. The main spider gathers the "parent" information and this function is attempting to gather "child" information. Technically not a child record, but definitely a 1 to many relationship.)
I've tried a number of possible solutions. This is the only version that makes sense to me and seems like it should work. I'm sure I'm just not splitting the genre string correctly.
You are very close.
Your culprit seems to be the way you are getting your items:
[i.split(',') for i in response.xpath('//span[contains(#class, "genre")]/text()').extract()]
Without having the source I can't correct you fully but it is obvious here your code is returning a list of lists.
You should either flatten this list of lists into list of strings or iterate through it appropriately:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
for item in items:
for category in item.split(','):
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()
Alternative more advance technique would be to use list nested comprehension:
items = response.xpath('//span[contains (#class, "genre")]/text()').extract()]
# good cheatsheet to remember this [leaf for tree in forest for leaf in tree]
categories = [cat for item in items for cat in items]
for category in categories:
sg = ItemLoader(item=ItemGenre(), response=response)
sg.add_value('url', response.url)
sg.add_value('genre', category, MapCompose(str.strip))
yield sg.load_item ()

How to use loaded data to add a new value in an ItemLoader?

I have started a scraping project, and I have a small problem with ItemLoader.
Suppose I have some ItemLoader in a scraper:
l = ScraperProductLoader(item=ScraperProduct(), selector=node)
l.add_xpath('sku', 'id/text()')
I would like to add a URL to the item loader based on the sku I have provided:
l.add_value('url', '?????')
...However, based on the documentation, I don't see a clear way to do this.
Options I have considered:
Input processor: Add a string, and pass the sku as the context somehow
Handle separately: Create the URL without using the item loader
How can I use loaded data to add a new value in an ItemLoader?
You can use get_output_value() method:
get_output_value(field_name)
Return the collected values parsed using
the output processor, for the given field. This method doesn’t
populate or modify the item at all.
l.add_value('url', 'http://domain.com/' + l.get_output_value('scu'))

Categories

Resources