Simple way to change scrapy .getall() delimiter - python

I'm running a basic scrapy crawler and I can't seem to find any documentation within scrapy that allows me to change the delimiter of a .getall(). The default appears to be comma separated, but I'm assuming this might cause some errors in data importing elsewhere.
Ideally, I want the exported csv to be comma separated, but the getall() data is pipe or semi-colon separated. I would prefer to fix this efficiently within the scrapy script. For example, say the bit containing the .getall() is
def entry_parse(self, response):
for entry in response.xpath("//tbody[#class='entry-grid-body infinite']//td[#class]"):
yield {'entry_labels': entry.xpath(".//div[#class='entry-labels']/span/text()").getall()}
Ideally, it would be nice to be able pass such an argument into getall() or something similar, but I can't seem to find any documentation allowing that. Any ideas would be helpful! Thanks.

This is not really a problem of scrapy. Since the .getall() method returns a list and the repr of lists have commas by default
>>>repr(["a","b"])
"['a', 'b']"
you can use json.dumps and change the delimiter before yielding the item using the separators argument
import json
def entry_parse(self, response):
for entry in response.xpath("//tbody[#class='entry-grid-body infinite']//td[#class]"):
yield {
'entry_labels': json.dumps(
entry.xpath(".//div[#class='entry-labels']/span/text()").getall()
, separators=("|", ":")
)
}

Related

regex that access json data from javascript html tag with scrapy

I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that
I found the response.css to the desired tag which result looks like this in scrapy shell:
response.css('div.rich-snippet script').get()
'<script type="application/ld+json">{\n some json data with newline chars \n }\n ]\n}</script>'
I need everything between {} but, so I tried regex to do it, like this:
response.css('div.rich-snippet script').re(r'\{[^}]*\}')
this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list
I tried more but always the same results, an empty list
.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...
so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible
import scrapy
import json
class SomeSpider(scrapy.Spider):
name = 'test'
start_urls = [
'url'
]
def parse(self, response, **kwargs):
json_file = response.css('div.rich-snippet script').get()
yield json.loads(json_file)
yet again, an empty result
Pls help me to understand, thanks.
Your css selector should specify that you only want the part inside the tag, that is should be ::text, so your code becomes:
def parse(self, response, **kwargs):
json_file = response.css('div.rich-snippet script::text')
yield json.loads(json_file)
You might also want to have a look at:
https://github.com/scrapinghub/extruct
It might better fit parsing ld+json
You could take the response as a string and use a recursive regex on. Recursion is not supported by the original re module but by the newer regex one.
That said, a possible approach could be:
import regex
# code before
some_json_string = response.css('div.rich-snippet script').get()
match = regex.search(r'\{(?:[^{}]*|(?R))+\}', some_json_string)
if match:
relevant_json = match.group(0)
# process it further here
See a demo on regex101.com for the expression.
Edit:
It seems that ::text is supported, so better use this answer instead.

Python lxml html builder

hey guys i want to make a html in python. I read a xml with python requests. And i counted the elements of an attribute.
count = len(nodeData.xpath("//user[#condition='good']"))
print (count)`
like this.
but now i want to get a table in which the number of the count stays.
nodeRow = html.TR(html.TD(count , style="background-color:#FF0000")
nodeTable.append(nodeRow)
print etree.tostring(nodeTable)
with open("out3.html", "wb") as f:
f.write(etree.tostring(nodeTable))
But that doesn't work. The error is
TypeError: bad argument type: int(2746)
The error-code is pretty clear - you can't put strings into the text-content of an Element. As you have an int, Python balks. Convert it to a string first:
nodeRow = html.TR(html.TD(str(count) , style="background-color:#FF0000")
You should consider using a template library though, it will make doing this much easier, as it takes care of these little obstacles, and allows a more natural writing of longer HTML snippets.

Cleaning data scraped using Scrapy

I have recently started using Scrapy and am trying to clean some data I have scraped and want to export to CSV, namely the following three examples:
Example 1 – removing certain text
Example 2 – removing/replacing unwanted characters
Example 3 –splitting comma separated text
Example 1 data looks like:
Text I want,Text I don’t want
Using the following code:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract()
Example 2 data looks like:
 - but I want to change this to £
Using the following code:
' Scraped 2': response.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract()
Example 3 data looks like:
Item 1,Item 2,Item 3,Item 4,Item 4,Item5 – ultimately I want to split
this into separate columns in a CSV file
Using the following code:
' Scraped 3': response.xpath('//div/div/div/ul/li/p/text()').extract()
I have tried using str.replace(), but can’t seem to get that to work, e.g:
'Scraped 1': response.xpath('//div/div/div/h1/span/text()').extract((str.replace(",Text I don't want",""))
I am looking into this but what appreciate if anyone could point me in the right direction!
Code below:
import scrapy
from scrapy.loader import ItemLoader
from tutorial.items import Product
class QuotesSpider(scrapy.Spider):
name = "quotes_product"
start_urls = [
'http://www.unitestudents.com/',
]
# Step 1
def parse(self, response):
for city in response.xpath('//select[#id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)
# Step 2
def parse_citypage(self, response):
for url in response.xpath('//div[#class="property-header"]/h3/span/a/#href').extract(): #Select for each property the url
yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)
# Step 3
def parse_unitpage(self, response):
for final in response.xpath('//div/div/div[#class="content__btn"]/a/#href').extract(): #Select final page for data scrape
yield scrapy.Request(response.urljoin(final), callback=self.parse_final)
#Step 4
def parse_final(self, response):
unitTypes = response.xpath('//html/body/div').extract()
for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
l = ItemLoader(item=Product(), response=response)
l.add_xpath('area_name', '//div/ul/li/a/span/text()')
l.add_xpath('type', '//div/div/div/h1/span/text()')
l.add_xpath('period', '/html/body/div/div/section/div/form/h4/span/text()')
l.add_xpath('duration_weekly', '//html/body/div/div/section/div/form/div/div/em/text()')
l.add_xpath('guide_total', '//html/body/div/div/section/div/form/div/div/p/text()')
l.add_xpath('amenities','//div/div/div/ul/li/p/text()')
return l.load_item()
However, I'm getting the following?
value = self.item.fields[field_name].get(key, default)
KeyError: 'type'
You have the right idea with str.replace, although I would suggest the Python 're' regular expressions library as it is more powerful. The documentation is top notch and you can find some useful code samples there.
I am not familiar with the scrapy library, but it looks like .extract() returns a list of strings. If you want to transform these using str.replace or one of the regex functions, you will need to use a list comprehension:
'Selector 1': [ x.replace('A', 'B') for x in response.xpath('...').extract() ]
Edit: Regarding the separate columns-- if the data is already comma-separated just write it directly to a file! If you want to split the comma-separated data to do some transformations, you can use str.split like this:
"A,B,C".split(",") # returns [ "A", "B", "C" ]
In this case, the data returned from .extract() will be a list of comma-separated strings. If you use a list comprehension as above, you will end up with a list-of-lists.
If you want something more sophisticated than splitting on each comma, you can use python's csv library.
It would be much easier to provide a more specific answer if you would have provided your spider and item definitions. Here are some generic guidelines.
If you want to keep things modular and follow the Scrapy's suggest project architecture and separation of concerns, you should be cleaning and preparing your data for further export via Item Loaders with input and output processors.
For the first two examples, MapCompose looks like a good fit.

How to do parsing in python?

I'm kinda new to Python. And I'm trying to find out how to do parsing in Python?
I've got a task: to do parsing with some piece of unknown for me symbols and put it to DB. I guess I can create DB and tables with help of SQLAlchemy, but I have no idea how to do parsing and what all these symbols below mean?
http://joxi.ru/YmEVXg6Iq3Q426
http://joxi.ru/E2pvG3NFxYgKrY
$$HDRPUBID 112701130020011127162536
H11127011300UNIQUEPONUMBER120011127
D11127011300UNIQUEPONUMBER100001112345678900000001
D21127011300UNIQUEPONUMBER1000011123456789AR000000001
D11127011300UNIQUEPONUMBER200002123456987X000000001
D21127011300UNIQUEPONUMBER200002123456987XIR000000000This item is inactive. 9781605600000
$$EOFPUBID 1127011300200111271625360000005
Thanks in advance those who can give me some advices what to start from and how the parsing is going on?
The best approach is to first figure out where each token begins and ends, and write a regular expression to capture these. The site RegexPal might help you design the regex.
As other suggest take a look to some regex tutorials, and also re module help.
Probably you're looking to something like this:
import re
headerMapping = {'type': (1,5), 'pubid': (6,11), 'batchID': (12,21),
'batchDate': (22,29), 'batchTime': (30,35)}
poaBatchHeaders = re.findall('\$\$HDR\d{30}', text)
parsedBatchHeaders = []
batchHeaderDict = {}
for poaHeader in poaBatchHeaders:
for key in headerMapping:
start = headerMapping[key][0]-1
end = headerMapping[key][1]
batchHeaderDict.update({key: poaHeader[start:end]})
parsedBatchHeaders.append(batchHeaderDict)
Then you have list with dicts, each dict contains data for each attribute. I assume that you have your datafile in text which is string. Each dict is made for one found structure (POA Batch Header in example).
If you want to parse it further, you have to made a function to parse each date in each attribute.
def batchDate(batch):
return (batch[0:2]+'-'+batch[2:4]+'-20'+batch[4:])
for header in parsedBatchHeaders:
header.update({'batchDate': batchDate( header['batchDate'] )})
Remember, that's an example and I don't know documentation of your data! I guess it works like that, but rest is up to you.

XML to store system paths in Python with lxml

I'm using an xml file to store configurations for a software.
One of theese configurations would be a system path like
> set_value = "c:\\test\\3 tests\\test"
i can store it by using:
> setting = etree.SubElement(settings,
> "setting", name=tmp_set_name, type =
> set_type , value= set_value)
If I use
doc.write(output_file, method='xml',encoding = 'utf-8', compression=0)
the file would be:
< setting type="str" name="MyPath" value="c:\test\3 tests\test"/>
Now I read it again with the etree.parse method
I obtain an etree child object with a string value, but the string
contains the
\3
character and if i try to use it to write again to xml it will be interpreted !!!!! So i cannot use it anymore as a path
Maybe i'm only missing a simple string operation, but I cannot see it =)
How would you solve it in a smart way ?
This is an example, but what is the best way, you think to store paths in xml and parse them with lxml ?
Thank you !!
Now I read it again with the
etree.parse method
I obtain an etree child object with a
string value, but the string contains
the
\3
character and if i try to use it to
write again to xml it will be
interpreted !!!!!
I just tried that, and it doesn't get "interpreted". The elements attributes as returned after parsed is:
{'type': 'str', 'name': 'yowza!', 'value': 'c:\\test\\3 tests\\test'}
So as you see this works just as you expected it to work. If you really have this problem, you are doing something else than what you are saying. Show us the real code, or make a small example code where you demonstrate the problem and use that.

Categories

Resources