Scrapy formatting results - python

I'm just starting to get to grips with Scrapy. So far, I've figured out how to extract the relevant sections of a web page and to crawl through web pages.
However, I'm still unsure as to how one can format the results in a meaningful tabular format.
When the scraped data is an table format, it's straightforward enough. However, sometimes the data isn't. e.g. this link
I can access the names using
response.xpath('//div[#align="center"]//h3').extract()
Then I can access the details using
response.xpath('//div[#align="center"]//p').extract()
Now, I need to format the data like this, so I can save it to a CSV file.
Name: J Speirs Farms Ltd
Herd Prefix: Pepperstock
Membership No. 7580
Dept. Herd Mark: UK244821
Membership Type: Youth
Year Joined: 2006
Address: Pepsal End Farm, Pepperstock, Luton, Beds
Postcode: LU1 4LH
Region: East Midlands
Telephone: 01582450962
Email:
Website:
Ideally, I'd like to define the structure of the data, then use populate according to the scraped data. Because in some cases, certain fields are not available, e.g. Email: and Website:
I don't need the answer, but would appreciate if someone can point me in the right direction.

All of the data seem to be separated by newlines, so simply use str.splitlines():
> names = response.xpath('//div[#align="center"]//a[#name]')
> details = names[0].xpath('following-sibling::p[1]/text()').extract_first().splitlines()
['J Speirs Farms Ltd ', 'Herd Prefix: Pepperstock ', 'Membership No. 7580 ', 'Dept. Herd Mark: UK244821 ', 'Membership Type: Youth ', 'Year Joined: 2006 ', 'Address: Pepsal End Farm ', ' Pepperstock ', ' Luton ', ' Beds ', 'Postcode: LU1 4LH ', 'Region: East Midlands ', 'Telephone: 01582450962 ']
> name = names[0].xpath('#name').extract_first()
'J+Speirs+Farms+Ltd+++'
Now you just need to figure out how to parse those bits into clean format:
Some names are split in multiple lines but you can identify and fix the list by checking whether members contain : or ., if not they belong to preceding member that does:
clean_details = [f'Name: {details[0]}']
# first item is name, skip
for d in details[1:]:
if ':' in d or 'No.' in d:
clean_details.append(d)
else:
clean_details[-1] += d
Finally parse the cleaned up details list we have:
item = {}
for detail in clean_details:
values = detail.split(':')
if len(values) < 2: # e.g. Membership No.
values = detail.split('No.')
if len(values) == 2: # e.g. telephone: 1337
label, text = values
item[label] = text.strip()
>>> pprint(item)
{'Address': 'Pepsal End Farm Pepperstock Luton Beds',
'Dept. Herd Mark': 'UK244821',
'Herd Prefix': 'Pepperstock',
'Membership ': '7580',
'Membership Type': 'Youth',
'Name': 'J Speirs Farms Ltd',
'Postcode': 'LU1 4LH',
'Region': 'East Midlands',
'Telephone': '01582450962',
'Year Joined': '2006'}

You can define a class for the items you want to save and import the class to your spider. Then you can directly save the items.

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

Avoid duplicate values on scrapy

I'm scrapying MOOCs data from course talk pages, and I'm having issues to clean some of the fields, E.G. The university name.
From the above link I want to get: Massachusetts Institute of Technology
This is the xpath I'm using for that field:
response.xpath('//*[#class="course-info__school__name"]//text()').extract()
The problem here is that I'm getting duplicated values and empty strings from it:
[u'\n ',
u'University:\xa0',
u'\n Massachusetts Institute of Technology\n ',
u'\n ',
u'University:\xa0',
u'\n Massachusetts Institute of Technology\n ']
You can skip inner span by using not (to exclude inner child span node) function and normalize-space function to skip white-space only text strings and clean text:
//*[#class="course-info__school__name"]/text()[not(self::span)][normalize-space()]
In result you should get two equal strings with university name only:
[u'Massachusetts Institute of Technology',
u'Massachusetts Institute of Technology']
And you can use python set to get unique names only:
>>> l = [u'Massachusetts Institute of Technology',
... u'Massachusetts Institute of Technology']
>>> set(l)
set([u'Massachusetts Institute of Technology'])
If you need contents of first div only, you can get it by index 1 with just xpath:
(//*[#class="course-info__school__name"])[1]/text()[not(self::span)][normalize-space()]
The reason lies in the fact that there are two divs with class name course-info__school__name.
Therefore, to avoid duplicates, you could change the xpath so that it only select the first div element with class name of course-info__school__name
response.xpath('(//div[#class="course-info__school__name"])[1]//text()').extract()
which will will give you the result of
['\n ',
'University:\xa0',
'\n Massachusetts Institute of Technology\n ']
Hope it helps!
You can try this way to get unique values always.
set(response.xpath('//*[#class="course-info__school__name"]//text()').extract())

In Scrapy, how to extract two groups in a regular expression into two different fields?

I'm writing a spider trulia to scrape pages of properties for sale on Trulia.com such as https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123; the current version can be found on https://github.com/khpeek/trulia-scraper.
I'm using Item Loaders and invoking the add_xpath method with the re keyword argument to specify regular expressions to extract. In the example in the documentation, there is just one group in the regular expression and one field to extract to.
However, I would actually like to define two groups and extract them to two separate Scrapy fields. Here is an 'excerpt' from the parse_property_page method:
def parse_property_page(self, response):
l = TruliaItemLoader(item=TruliaItem(), response=response)
details = l.nested_css('.homeDetailsHeading')
overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
overview.add_xpath('overview', xpath='.//li/text()')
overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')
Notice how the lot_size field has two groups extracted: one for the number, and one for the units which can be either 'acres' or 'sqft'. If I run this parse method using the command
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page
then I get the following scraped item:
# Scraped Items ------------------------------------------------------------
[{'address': '1860 Lombard St',
'area': 2524.0,
'city_state': 'San Francisco, CA 94123',
'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
'description': ['Outstanding investment opportunity to own this light-fixer '
'mixed use Marina 2-unit property w/established income and '
'not on liquefaction. The first floor of this building '
'houses a commercial business currently leased to Jigalin '
'Fitness until 2018. The second floor presents a 2bed/1bath '
'apartment fully outfitted in a contemporary design w/full '
'kitchen, 10ft high ceilings & laundry area. The apartment '
'will be delivered vacant. The structure has undergone '
'renovation & features concrete perimeter foundation, '
'reinforced walls, ADA compliant commercial restroom, '
'electrical updates & rolling door. This property makes an '
"ideal investment with instant cash flow. Don't let this "
'pass you by. As-Is sale.'],
'events': ['Sold', 'Sold', 'Sold'],
'listing_information': ['2 Bedrooms', 'Multi-Family'],
'listing_information_date_updated': '11/03/2017',
'lot_size': ['1620', 'sqft'],
'neighborhood': 'Marina',
'overview': ['Multi-Family',
'2 Beds',
'Built in 1908',
'1 days on Trulia',
'1620 sqft lot size',
'2,524 sqft',
'$711/sqft'],
'prices': ['$850,000', '$1,350,000', '$1,200,000'],
'public_records': ['1 Bathroom',
'Multi-Family',
'1,296 Square Feet',
'Lot Size: 1,620 sqft'],
'public_records_date_updated': '07/01/2017',
'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]
where the lot_size field is a list with the number and the unit. However, I'd ideally like to extract the unit (acres or sqft) to a separate field lot_size_units. I could do this by first loading the item and doing my own processing, but I was wondering whether there is a more Scrapy-native way to 'unpack' the matched groups into different items?
(I've perused the get_value method on https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py, but this hasn't 'shown me the way' yet if there is any).
You could try this (ignoring one group at a time):
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')

Conditionals on replacement in a string

So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'
I want to replace all country instances except when we have an 'of ' or 'of the '
Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.
If these do exist then we do not remove the country, else we do remove country. The examples will become:
'Bank of China', or 'Embassy of China', and 'International'
However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.
Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?
My function is along these lines:
def removeCountry(name):
for country in countries:
if country in name:
if 'of ' + country in name:
return name
if 'of the ' + country in name:
return name
else:
name = re.sub(country + '$', '', name).strip()
return name
return name
EDIT: I did find some info here. This does describe how to do an if, but I really want a
if not 'of '
if not 'of the '
then replace...
You could compile a few sets of regular expressions, then pass your list of input through them. Something like:
import re
countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]
def remove_country(s):
for regex in takes:
if regex.search(s):
return s
for regex in subs:
s = regex.sub('', s)
return s
print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')
''' Output:
the bank of foo
the bank of the baz
the nation
'''
It doesn't look like anything faster than linear time complexity is possible here. At least you can avoid recompiling the regular expressions a million times and improve the constant factor.
Edit: I had a few typos, bu the basic idea is sound and it works. I've added an example.
I think you could use the approach in Python: how to determine if a list of words exist in a string to find any countries mentioned, then do further processing from there.
Something like
countries = [
"Afghanistan",
"Albania",
"Algeria",
"Andorra",
"Angola",
"Anguilla",
"Antigua",
"Arabia",
"Argentina",
"Armenia",
"Aruba",
"Australia",
"Austria",
"Azerbaijan",
"Bahamas",
"Bahrain",
"China",
"Russia"
# etc
]
def find_words_from_set_in_string(set_):
set_ = set(set_)
def words_in_string(s):
return set_.intersection(s.split())
return words_in_string
get_countries = find_words_from_set_in_string(countries)
then
get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")
returns
set(['Argentina', 'China', 'Russia'])
... which obviously needs more post-processing, but very quickly tells you exactly what you need to look for.
As pointed out in the linked article, you must be wary of words ending in punctuation - which could be handled by something like s.split(" \t\r\n,.!?;:'\""). You may also want to look for adjectival forms, ie "Russian", "Chinese", etc.
Not tested:
def removeCountry(name):
for country in countries:
name = re.sub('(?<!of (the )?)' + country + '$', '', name).strip()
Using negative lookbehind re.sub just matches and replaces when country is not preceded by of or of the
The re.sub function accepts a function as replacement text, which is called in order to get the text that should be substituted in the given match. So you could do this:
import re
def make_regex(countries):
escaped = (re.escape(country) for country in countries)
states = '|'.join(escaped)
return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))
def remove_name(match):
name = match.group()
if name.lstrip().startswith('of'):
return name
else:
return name.replace(match.group('state'), '').strip()
regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'
The result might contain some spurious space (in the above case a last strip() is needed). You can fix this modifying the regex to:
\s*(of(\sthe)?\s)?(?P<state>({}))
To catch the spaces before of or before the country name and avoid the bad spacing in the output.
Note that this solution can handle a whole text, not just text of the form Something of Country and Something Country. For example:
In [38]: regex = make_regex(['China'])
...: text = '''This is more complex than just "Embassy of China" and "International China"'''
In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'
an other example usage:
In [33]: countries = [
...: 'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
...: 'France', 'Italy', 'Australia', 'New Zealand', 'Brazil',
...: 'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
...: 'Spain', 'Portugal', 'Argentina', 'San Marino'
...: ]
In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'
In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)
In [36]: regex = make_regex(countries)
...: result = regex.sub(remove_name, text)
In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'

Python: getting a certain no. of strings from a dictionary

I have a dictionary in the following format, i split the different elements (where a comma(,) occured) using a split function and am now trying to extract the names from the list...i am trying to use regular expression but obviously am miserably failing being new to python... the names are in the following formats...
firstname(space)last name
name(space)name(space)name
x.name
x.y.name
name(space) x.(space)(name)
where x and y represent the an name initial like J. for john etc.
also if you can guide me in removing the "\t" keeping other information intact would also be great.
any sort of help would be more than welcome...thank you all.
[[' I. Antonov', ' I. Antonova', ' E. R. Kandel', ' and R. D. Hawkins. Activity-dependent presynaptic facilitation and hebbian ltp are both required and interact during classical conditioning in aplysia. Neuron', ' 37(1):135--47', ' Jan 2003.'], ['\tSander M. Bohte ', ' Joost N. Kok', ' Applications of spiking neural networks', ' Information Processing Letters', ' v.95 n.6', ' p.519-520'], [' L. J. Eshelman. The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination. Foundations Of Genetic Algorithms', ' pages 265-283', ' 1990.'], ['Wulfram Gerstner ', ' Werner Kistler', ' Spiking Neuron Models: An Introduction', ' Cambridge University Press', ''], [' D. O. Hebb. Organization of behavior. New York: Wiley', ' 1949.'], [' D. Z. Jin. Spiking neural network for recognizing spatiotemporal sequences of spikes. Physical Review E', '69', ' 2004.'], ['Wolfgang Maass ', ' Christopher M. Bishop', ' Pulsed Neural Networks', ' MIT Press', ' '], ['Wolfgang Maass ', ' Henry Markram', ' Synapses as dynamic memory buffers', ' Neural Networks', ' v.15 n.2', ' p.'], [' H. Markram', ' Y. Wang', ' and M. Tsodyks. Differential signaling via the same axon of neocortical pyramidal neurons. Neurobiology', ' 95:5323--5328', ' April 1998.'], ['\t\tD. E. Rumelhart ', ' G. E. Hinton ', ' R. J. Williams', ' Learning internal representations by error propagation', ' Parallel distributed processing: explorations in the microstructure of cognition', ' vol. 1: foundations', ' MIT Press', ' Cambridge', ' MA', ' 1986 </a> \t\t\t\t\t\t\t\t\t'], ['\t J. D. Schaffer', ' L. D. Whitley', ' and L. J. Eshelman. Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Combinations of Genetic Algorithms and NeuralNetworks', ' 1992.', ' COGANN-92. International Workshop on', ' pages 1--37', ' Philips Labs.', ' Briarcliff Manor', ' NY', ' 6 Jun 1992.'], ['\t S. Song', ' K. D. Miller', ' and L. F. Abbott. Competitive hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience', ' 3(9):919--926', ' 2000.'], ['\t L. Watts. Event-driven simulation of networks of spiking neurons. Advances in Neural Information Processing Systems', ' 6:927--934', ' 1994.']]
It looks like you're going to have to tailor this pretty heavily to your input. Because there are so many different words and constructs in the text you're parsing, you're probably not going to get 100% accuracy with the rules you create. Here's an example, though, assuming your original input text is called input_text (and I don't think using the split() method is really all that useful, because the commas don't just delimit names):
import re
regexes = (r'[A-Z][a-z]+ [A-Z][a-z]+', # capitalized first and last name
r'[A-Z]\. [A-Z][a-z]+') # capitalized initial, then last name
names = []
for regex in regexes:
names += re.findall(regex, input_text)
You'd obviously want to write additional specific regexes for your vaious name types. This does a good job of finding names, but also comes up with a lot of false positives (Information Processing looks a lot like a name based on these rules). This should give you a starting point though.
To remove the tab (and other empty spaces at beginning or end of the strings):
stripped = [s.strip() for t in mylist]
To be honest, if you are trying to extract names, splitting lines like that will not help -- notice how some names are still grouped together with titles. Would be better to build a good regex that will match names, and use re.findall on individual lines.
To remove tabs and extra spaces, use strip():
>>> "\t foobar \t\t\t".strip()
'foobar'
It may also be, that its easier to find some online source of information where this job has been already done. For example, at places like this or this.
strip all the strings
identify the string that are surely not names (very long ones, ones that include numbers, and one after these in the list)
indentify string that are surely names (short strings at the begining of the list, string starting by the pattern $[A-Z][a-z]{0,3}.?\s (Dr., Miss, Mr, Prof, etc)
sudy the last strings that you can't match with these rules, and try to make fuzzy rules to chose by creating a coefficient of certidude: the close to the beginin of the list, the shorter strings will have a hight score that something at the end with a big size. Add criterias like that and set a minimum score.
If you need a hight accuracy, loof for names database and bayesian filters.
It won't be perfect: it's very hard to know the difference between 'name name name' and 'word word word'

Categories

Resources