Avoid duplicate values on scrapy - python

I'm scrapying MOOCs data from course talk pages, and I'm having issues to clean some of the fields, E.G. The university name.
From the above link I want to get: Massachusetts Institute of Technology
This is the xpath I'm using for that field:
response.xpath('//*[#class="course-info__school__name"]//text()').extract()
The problem here is that I'm getting duplicated values and empty strings from it:
[u'\n ',
u'University:\xa0',
u'\n Massachusetts Institute of Technology\n ',
u'\n ',
u'University:\xa0',
u'\n Massachusetts Institute of Technology\n ']

You can skip inner span by using not (to exclude inner child span node) function and normalize-space function to skip white-space only text strings and clean text:
//*[#class="course-info__school__name"]/text()[not(self::span)][normalize-space()]
In result you should get two equal strings with university name only:
[u'Massachusetts Institute of Technology',
u'Massachusetts Institute of Technology']
And you can use python set to get unique names only:
>>> l = [u'Massachusetts Institute of Technology',
... u'Massachusetts Institute of Technology']
>>> set(l)
set([u'Massachusetts Institute of Technology'])
If you need contents of first div only, you can get it by index 1 with just xpath:
(//*[#class="course-info__school__name"])[1]/text()[not(self::span)][normalize-space()]

The reason lies in the fact that there are two divs with class name course-info__school__name.
Therefore, to avoid duplicates, you could change the xpath so that it only select the first div element with class name of course-info__school__name
response.xpath('(//div[#class="course-info__school__name"])[1]//text()').extract()
which will will give you the result of
['\n ',
'University:\xa0',
'\n Massachusetts Institute of Technology\n ']
Hope it helps!

You can try this way to get unique values always.
set(response.xpath('//*[#class="course-info__school__name"]//text()').extract())

Related

Iterate through specific tags in python

I want to extract text from the website and the format is like this:
Avalon
Avondale
Bacon Park Area
How do I just select those 'a' tags with href="#N" because there are several more?
I tried creating a list to iterate through but when I try the code, it selects only one element.
loc= ['#N0', '#N1', '#N2', '#N3', '#N4', '#N5'.....'#N100']
for i in loc:
name=soup.find('a', attrs={'href':i})
print(name)
I get
Avalon
not
Avalon
Avondale
<a href="#N4">Bacon Park Area</a
How about just?
Avalon
Avondale
Bacon Park Area
Thanks in advance!
You're iterating over the items, but not putting them anywhere. So when you are done with your loop all that's left in name is the last item.
You can put them in a list like below, and access the .text attribute to get just the name from the tag:
names = []
for i in loc:
names.append(soup.find('a',attrs={'href':i}).text)
Result:
In [15]: names
Out[15]: ['Bacon Park Area', 'Avondale', 'Avalon']
If you want to leave out the first list's creation you can just do:
import re
names = [tag.text for tag in soup.find_all('a',href=re.compile(r'#N\d+'))]
In a regular expression, the \d means digit and the + means one or more instances of.

Scrapy formatting results

I'm just starting to get to grips with Scrapy. So far, I've figured out how to extract the relevant sections of a web page and to crawl through web pages.
However, I'm still unsure as to how one can format the results in a meaningful tabular format.
When the scraped data is an table format, it's straightforward enough. However, sometimes the data isn't. e.g. this link
I can access the names using
response.xpath('//div[#align="center"]//h3').extract()
Then I can access the details using
response.xpath('//div[#align="center"]//p').extract()
Now, I need to format the data like this, so I can save it to a CSV file.
Name: J Speirs Farms Ltd
Herd Prefix: Pepperstock
Membership No. 7580
Dept. Herd Mark: UK244821
Membership Type: Youth
Year Joined: 2006
Address: Pepsal End Farm, Pepperstock, Luton, Beds
Postcode: LU1 4LH
Region: East Midlands
Telephone: 01582450962
Email:
Website:
Ideally, I'd like to define the structure of the data, then use populate according to the scraped data. Because in some cases, certain fields are not available, e.g. Email: and Website:
I don't need the answer, but would appreciate if someone can point me in the right direction.
All of the data seem to be separated by newlines, so simply use str.splitlines():
> names = response.xpath('//div[#align="center"]//a[#name]')
> details = names[0].xpath('following-sibling::p[1]/text()').extract_first().splitlines()
['J Speirs Farms Ltd ', 'Herd Prefix: Pepperstock ', 'Membership No. 7580 ', 'Dept. Herd Mark: UK244821 ', 'Membership Type: Youth ', 'Year Joined: 2006 ', 'Address: Pepsal End Farm ', ' Pepperstock ', ' Luton ', ' Beds ', 'Postcode: LU1 4LH ', 'Region: East Midlands ', 'Telephone: 01582450962 ']
> name = names[0].xpath('#name').extract_first()
'J+Speirs+Farms+Ltd+++'
Now you just need to figure out how to parse those bits into clean format:
Some names are split in multiple lines but you can identify and fix the list by checking whether members contain : or ., if not they belong to preceding member that does:
clean_details = [f'Name: {details[0]}']
# first item is name, skip
for d in details[1:]:
if ':' in d or 'No.' in d:
clean_details.append(d)
else:
clean_details[-1] += d
Finally parse the cleaned up details list we have:
item = {}
for detail in clean_details:
values = detail.split(':')
if len(values) < 2: # e.g. Membership No.
values = detail.split('No.')
if len(values) == 2: # e.g. telephone: 1337
label, text = values
item[label] = text.strip()
>>> pprint(item)
{'Address': 'Pepsal End Farm Pepperstock Luton Beds',
'Dept. Herd Mark': 'UK244821',
'Herd Prefix': 'Pepperstock',
'Membership ': '7580',
'Membership Type': 'Youth',
'Name': 'J Speirs Farms Ltd',
'Postcode': 'LU1 4LH',
'Region': 'East Midlands',
'Telephone': '01582450962',
'Year Joined': '2006'}
You can define a class for the items you want to save and import the class to your spider. Then you can directly save the items.

In Scrapy, how to extract two groups in a regular expression into two different fields?

I'm writing a spider trulia to scrape pages of properties for sale on Trulia.com such as https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123; the current version can be found on https://github.com/khpeek/trulia-scraper.
I'm using Item Loaders and invoking the add_xpath method with the re keyword argument to specify regular expressions to extract. In the example in the documentation, there is just one group in the regular expression and one field to extract to.
However, I would actually like to define two groups and extract them to two separate Scrapy fields. Here is an 'excerpt' from the parse_property_page method:
def parse_property_page(self, response):
l = TruliaItemLoader(item=TruliaItem(), response=response)
details = l.nested_css('.homeDetailsHeading')
overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
overview.add_xpath('overview', xpath='.//li/text()')
overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')
Notice how the lot_size field has two groups extracted: one for the number, and one for the units which can be either 'acres' or 'sqft'. If I run this parse method using the command
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page
then I get the following scraped item:
# Scraped Items ------------------------------------------------------------
[{'address': '1860 Lombard St',
'area': 2524.0,
'city_state': 'San Francisco, CA 94123',
'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
'description': ['Outstanding investment opportunity to own this light-fixer '
'mixed use Marina 2-unit property w/established income and '
'not on liquefaction. The first floor of this building '
'houses a commercial business currently leased to Jigalin '
'Fitness until 2018. The second floor presents a 2bed/1bath '
'apartment fully outfitted in a contemporary design w/full '
'kitchen, 10ft high ceilings & laundry area. The apartment '
'will be delivered vacant. The structure has undergone '
'renovation & features concrete perimeter foundation, '
'reinforced walls, ADA compliant commercial restroom, '
'electrical updates & rolling door. This property makes an '
"ideal investment with instant cash flow. Don't let this "
'pass you by. As-Is sale.'],
'events': ['Sold', 'Sold', 'Sold'],
'listing_information': ['2 Bedrooms', 'Multi-Family'],
'listing_information_date_updated': '11/03/2017',
'lot_size': ['1620', 'sqft'],
'neighborhood': 'Marina',
'overview': ['Multi-Family',
'2 Beds',
'Built in 1908',
'1 days on Trulia',
'1620 sqft lot size',
'2,524 sqft',
'$711/sqft'],
'prices': ['$850,000', '$1,350,000', '$1,200,000'],
'public_records': ['1 Bathroom',
'Multi-Family',
'1,296 Square Feet',
'Lot Size: 1,620 sqft'],
'public_records_date_updated': '07/01/2017',
'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]
where the lot_size field is a list with the number and the unit. However, I'd ideally like to extract the unit (acres or sqft) to a separate field lot_size_units. I could do this by first loading the item and doing my own processing, but I was wondering whether there is a more Scrapy-native way to 'unpack' the matched groups into different items?
(I've perused the get_value method on https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py, but this hasn't 'shown me the way' yet if there is any).
You could try this (ignoring one group at a time):
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')

Grab a String between strings using python

I have the following string,
s = {$deletedFields:name:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,25),name:Political Campaigns,$type:com.linkedin.voyager.identity.profile.Skill},{$deletedFields:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,28),name:Politics,$type:com.linkedin.voyager.identity.profile.Skill},name:
{$deletedFields:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,27),name:Political Consulting,$type:com.linkedin.voyager.identity.profile.Skill},
{$deletedFields:[standardizedSkillUrn,standardizedSkill],entityUrn:urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,26),name:Grassroots Organizing,$type:com.linkedin.voyager.identity.profile.Skill},
{$deletedFields:[],profileId:ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,elements:[urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,25),urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,26),urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,27),urn:li:fs_skill:(ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,28)],paging:urn:li:fs_profileView:ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,skillView,paging,$type:com.linkedin.voyager.identity.profile.SkillView,$id:urn:li:fs_profileView:ACoAAA0C3rkBDZ7qyoWoEmj9CxUv3QW6brC836w,skillView},
{$deletedFields:[]
I want to grab
name:Political Campaigns
name:Politics
name:Political Consulting
name:Grassroots Organizing
name = [Political Campaigns , Politics, Political Consulting, Grassroots Organizing]
The above string is from a file i want to scrap.
Keep in mind that name has many instances in the file,
is there a way to grab fs_skill then some garbage value but then look for name: near it and grab that string ending at.
data = [pair[5:] for pair in s.split(',') if pair[:4] == 'name' and pair[5].isalpha()]
Output:
['Political Campaigns', 'Politics', 'Political Consulting', 'Grassroots Organizing']
can you try above code snippet, hope this helps

Cut a field and LIKE search the parts

I have a list of names (actually, authors) stored in a sqlite database. Here is an example:
João Neres, Ruben C. Hartkoorn, Laurent R. Chiarelli, Ramakrishna Gadupudi, Maria Rosalia Pasca, Giorgia Mori, Alberto Venturelli, Svetlana Savina, Vadim Makarov, Gaelle S. Kolly, Elisabetta Molteni, Claudia Binda, Neeraj Dhar, Stefania Ferrari, Priscille Brodin, Vincent Delorme, Valérie Landry, Ana Luisa de Jesus Lopes Ribeiro, Davide Farina, Puneet Saxena, Florence Pojer, Antonio Carta, Rosaria Luciani, Alessio Porta, Giuseppe Zanoni, Edda De Rossi, Maria Paola Costi, Giovanna Riccardi, Stewart T. Cole
It's a string. My goal is to write an efficient "analyser" of name. So I basically perform a LIKE query:
' ' || replace(authors, ',', ' ') || ' ' LIKE '{0}'.format(my_string)
I basically replace all the commas with a space, and insert a space at the end and at the beginning of the string. So if I look for:
% Rossi %
I'll get all the items, where one of the authors has "Rossi" as a family name. "Rossi", not "Rossignol" or "Trossi".
It's an efficient way to look for an author with his family name, because I'm sure the string stored in the database contains the family names of the authors, unaltered.
But the main problem lies here: "Rossi" is, for example, a very common family name. So if I want to look for a very particular person, I will add his first name. Let's assume it is "Jean-Philippe". "Jean-Philippe" can be stored in the database under many forms: "J.P Rossi", "Jean-Philippe Rossi", "J. Rossi", "Jean P. Rossi", etc.
So I tried this:
% J%P Rossi %
But of course, It matches everything containing a J, then a P, and finally rossi. It matches the string I gave as an example. (Edda De Rossi)
So I wonder if there is a way to cut the string in the query, on a delimiter, and then match each piece against the search pattern.
Of course I'm open to any other solution. My goal is to match the search pattern against each author name.

Categories

Resources