import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.
from datasets import load_dataset #Huggingface
from transformers import BertTokenizer #Huggingface:
def tokenized_dataset(dataset):
""" Method that tokenizes each document in the train, test and validation dataset
Args:
dataset (DatasetDict): dataset that will be tokenized (train, test, validation)
Returns:
dict: dataset once tokenized
"""
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
encode = lambda document: tokenizer(document, return_tensors='pt', padding=True, truncation=True)
train_articles = [encode(document) for document in dataset["train"]["article"]]
test_articles = [encode(document) for document in dataset["test"]["article"]]
val_articles = [encode(document) for document in dataset["val"]["article"]]
train_abstracts = [encode(document) for document in dataset["train"]["abstract"]]
test_abstracts = [encode(document) for document in dataset["test"]["abstract"]]
val_abstracts = [encode(document) for document in dataset["val"]["abstract"]]
return {"train": (train_articles, train_abstracts),
"test": (test_articles, test_abstracts),
"val": (val_articles, val_abstracts)}
if __name__ == "__main__":
dataset = load_data("./train/", "./test/", "./val/", "./.cache_dir")
tokenized_data = tokenized_dataset(dataset)
I would like to modify the function tokenized_dataset because it creates a very heavy dictionary. The dataset created by that function will be reused for a ML training. However, dragging that dictionary during the training will slow down the training a lot.
Be aware that document is similar to
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
So in the dictionary, the keys are just strings, but the values are all list of lists of strings. Instead of having value = list of list of strings, I think it is better to create sort of a list of object function instead of having list of lists of strings. It will make the dictionary much lighter. How can I do that?
EDIT
For me, there is a difference between copying a list of lists of strings and copying a list of objects. Copying an object will simply copy the reference while copying a list of strings will copy everything. So it is much faster to copy the reference instead. This is the point of this question.
A python beginner get tortured here, need some help:-(
def getLinkinfo(endpoint):
response = urllib.request.urlopen(endpoint)
link_info = response.read().decode()
return link_info
text=getLinkinfo('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
soup1=BeautifulSoup(text,'html.parser')
k=soup1.find_all('div','class'=='article')
Here I have already cut the main body I need to deal with, and one of the outputs is as shown below:
<div class="article"><h5>1. Let's resolve to reconnect, says Welby in new year message</h5>
<p class="metadata">Wed 1 Jan 2020 00:01 GMT</p>
<p class="metadata">Category: <span>UK-News</span></p>
<p class="snippet">The archbishop of Canterbury will urge people to make personal connections with others in 2020 to create a new unity in a divided society. In his new …</p></div>
Here my question is how can i get:
(1)'Title' between <h5>,<a> tag
(2) 'Category' which is behind <p class="metadata">(Here are two <p class="metadata">s, one with time is not needed)
(3)'Snippet' which is behind <p class="snippet">
Thx for help in advance, I feel if I know how to deal with this example, I can process a lot
from urllib.request import urlopen
from bs4 import BeautifulSoup
from pprint import pprint as pp
def main(url):
r = urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
goal = [(x.a.text, x.select("p")[1].text.split(' ', 1)[1], x.select_one('p.snippet').text)
for x in soup.select('.article')]
pp(goal)
main('http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html')
Output:
[("Let's resolve to reconnect, says Welby in new year message",
'UK-News',
'The archbishop of Canterbury will urge people to make personal connections '
'with others in 2020 to create a new unity in a divided society. In his new '
'…'),
("Be honest. You're not going to read all those books on your holiday, are "
'you?',
'Books',
'Every year, about this time, my Instagram feed fills up with pictures of '
'books. They’re piled somewhere between five and ten inches high, sometimes '
'st …'),
("Mariah Carey's Twitter account hacked on New Year's Eve",
'Music',
'Mariah Carey’s Twitter account appeared to have been hacked late Tuesday '
'afternoon, sharing numerous racist slurs and comments with the singer’s '
'21.4 …'),
('The joy audit: how to have more fun in 2020',
'Life-and-Style',
'The last time I felt joy was at an event that would be many people’s vision '
'of hell: a drunken Taylor Swift club-night singalong in the early hours of '
'…'),
('Providence Lost by Paul Lay review – the rise and fall of Oliver Cromwell’s '
'Protectorate',
'Books',
'The only public execution of a British head of state occurred 371 years ago '
'outside the Banqueting House in Whitehall on 30 January 1649. It was a rad '
'…'),
('Zero-carbon electricity outstrips fossil fuels in Britain across 2019',
'Business',
'Summary: Zero-carbon energy became Britain’s largest electricity source in '
'2019, delivering nearly half the country’s electrical power and for the '
'first time o …'),
('The final sprint: will any of the Democratic candidates excite voters?',
'US-News',
'Democrats overwhelmingly agree that their top priority in 2020 is to remove '
'Donald Trump from office. But which of the many Democrats running for pres '
'…'),
('War epics, airmen and young Sopranos: essential films for 2020',
'Film',
'1917 An epic of Lean-ian proportions is delivered in this spectacular from '
'director and co-writer Sam Mendes, who has developed a real-life story of h '
'…'),
("Stashing your cash: the beginner's guide to saving",
'Life-and-Style',
'Much like going for a run or eating your greens, saving your cash offers '
'long-term benefits, but is not always appealing. And, let’s face it, there '
'ar …'),
("'I'm on the hunt for humour and hope': what will authors be reading in "
'2020?',
'Books',
'Matt Haig I have been very dark and gloomy with my reading habits this '
'year, perhaps in tune with the social mood. Like a pig sniffing for '
'truffles, I …'),
('Twenty athletes set to light up the Tokyo 2020 Olympics',
'Sport',
'Dina Asher-Smith Great Britain Athletics, 100m, 200m, 4x100m Seb Coe, who '
'knows a thing or two about winning Olympic titles, believes Asher-Smith '
'will …'),
('The most exciting movies of 2020 – horror',
'Film',
'The Grudge A belated English language reboot of Japanese classic Ju-On: The '
'Grudge (2002), this stars Andrea Riseborough and Demián Bichir as detectiv '
'…'),
('Diary of a Murderer by Kim Young-ha review – dark stories from South Korea',
'Books',
'Given that loss of memory has become a familiar device in fiction, and the '
'psychopath such a popular character archetype, we shouldn’t be too surprise '
'…'),
('US election, Brexit and China to sway the markets in 2020',
'Business',
'After profiting from strong markets in 2019, investors are expecting 2020 '
'to bring further rising asset prices and lively merger activity. But '
'Brexit, …'),
('TS Eliot’s intimate letters to confidante unveiled after 60 years',
'Books',
'A collection of more than 1,000 letters from the Nobel laureate TS Eliot to '
'his confidante and muse Emily Hale is unveiled this week, after having bee '
'…'),
('Clive Lewis calls for unity among Labour leadership hopefuls',
'Politics',
'Summary: The Labour leadership hopeful Clive Lewis has called for unity '
'among would-be candidates to succeed Jeremy Corbyn as they confront the '
'“cliff face” of …'),
("Visa applications: Home Office refuses to reveal 'high risk' countries",
'UK-News',
'Summary: Campaign groups have criticised the Home Office after it refused '
'to release details of which countries are deemed a “risk” in an algorithm '
'that filter …'),
('Victims of NYE Surrey road crash were BA cabin crew',
'UK-News',
'At least seven people have been killed across the UK in road traffic '
'collisions over the new year period. The deaths included three British '
'Airways ca …'),
('Man held on suspicion of double murder after bodies found in house',
'UK-News',
'Police have arrested a man on suspicion of murdering two people at a house '
'in the village of Duffield in Derbyshire. The murder investigation was laun '
'…'),
("Great expectations: 'The quest for perfection has cannibalised my identity'",
'Life-and-Style',
'“You need to practice self-compassion,” my psychologist says to me. This is '
'our sixth session and as per usual he is struggling to find a phrase, a po '
'…'),
('Anti-Islamic slogans spray-painted near mosque in Brixton',
'UK-News',
'Anti-Islamic slogans have been painted on a building close to a mosque and '
'cultural centre in south London, the Metropolitan police have said. Officer '
'…'),
('Michael van Gerwen 3-7 Peter Wright: PDC world darts championship final – '
'as it happened',
'Sport',
'Summary: That’s it for tonight’s blog, so I’ll leave you with a report from '
'Ally Pally. Thanks for your company, goodnight! There’s so much affection '
'for Peter …'),
('Greggs launches meatless steak bake to beef up its vegan range',
'Business',
'Greggs, the UK’s largest bakery chain, will end speculation about its hotly '
'anticipated new vegan snack by launching a meat-free version of its popula '
'…'),
('Woodford folk festival review – a much-needed moment of positivity and '
'reprieve',
'Music',
'If Woodford folk festival was in mourning this year, you wouldn’t have '
'known it. The death in May of festival elder and decade-long patron Bob '
'Hawke c …'),
('Household haze: how to reduce smoke in your home without an air purifier',
'Life-and-Style',
'Summary: On 1 January, Canberra experienced its worst air quality on '
'record. Smoke from Australia’s devastating bushfires has now blown as far '
'as Queenstown, N …'),
("Sadiq Khan pledges free London travel for disabled people's carers",
'Politics',
'Sadiq Khan has kickstarted his bid for a second term as London mayor by '
'pledging free travel on the city’s transport for anyone accompanying a '
'disable …'),
('‘Everyone thought I was mad’: how to make a life-changing decision – and '
'stick to it',
'Life-and-Style',
'Summary: When I was 26, I broke up with a long-term partner, got an '
'ill-advised facial piercing and changed careers – all in the space of a '
'month. What I learn …'),
('In the Line of Duty review – race-against-time cop thriller',
'Film',
'There’s a straight-to-video feel to this cop thriller, directed by action '
'veteran Steven C Miller, written by Jeremy Drysdale (who scripted the indie '
'…'),
("Manchester poet Tony Walsh performs tribute to children's hospital",
'UK-News',
'The performance poet Tony Walsh, whose ode to Manchester became a worldwide '
'hit after the Arena bomb, has written a moving tribute to Royal Manchester '
'…'),
('Can your phone keep you fit? Our writers try 10 big fitness apps – from '
'weightlifting to pilates',
'Life-and-Style',
'Centr Price £15.49 a month. What is it? A full-service experience from the '
'Hollywood star Chris Hemsworth: not just workouts, but a complete meal plan '
'…'),
('We Are from Jazz review – zany Russian musical comedy',
'Film',
'Only in a Woody Allen film will you hear quite as much Dixieland jazz as '
'this. Here is We Are from Jazz, or We Are Jazzmen, the zany jazz comedy '
'music …'),
('The Other Half of Augusta Hope by Joanna Glen review – high emotions',
'Books',
'Summary: Who is Augusta Hope’s “other half”? In Glen’s debut, shortlisted '
'for the Costa first novel prize, at first it’s Augusta’s twin sister, '
'although the di …'),
('Tara Erraught/James Baillieu review – quietly intense and simply exquisite',
'Music',
'Irish mezzo Tara Erraught’s latest Wigmore recital with her pianist James '
'Baillieu took place between Christmas and New Year, though her beautifully '
'c …'),
('Talking Horses: picking the five best races of the last decade',
'Sport',
'You might take the view that the end of the decade is actually a year away, '
'but at least it’s 10 years since I last did something like this. I’ve limi '
'…'),
('The Reality Bubble by Ziya Tong review – blind spots and hidden truths',
'Books',
'Publishing functions very much like the fashion world. Like a suddenly '
'ubiquitous cut of hem or style of trainer, a book comes along every few '
'seasons …'),
('Alleged drink-driver arrested on motorway had no front tyres',
'UK-News',
'An alleged drink-driver who was arrested on the motorway on New Year’s Day '
'had been driving without front tyres. The motorist was said to be nearly si '
'…'),
('MC Beaton, multimillion-selling author of Agatha Raisin novels, dies aged '
'83',
'Books',
'MC Beaton, the prolific creator of the much loved fictional detectives '
'Agatha Raisin and Hamish Macbeth, has died after a short illness at the age '
'of …'),
("First transgender Marvel superhero coming 'very soon'",
'Film',
'The first transgender character in a Marvel movie will probably appear in a '
'film released next year. Speaking at an event at the New York Film Academy '
'…'),
('The six-pack can wait: how to set fitness goals you will actually keep',
'Life-and-Style',
'Summary: Most of us have, at some point in our lives, looked in the mirror '
'and decided we need a radical image overhaul – especially in January. Then, '
'when we …'),
('Gold from Highlands mine to be made into Scottish jewellery',
'UK-News',
'A small goldmine in the Highlands plans to start producing gold in '
'commercial quantities for the first time after repeated delays. The mine at '
'Cononis …'),
('Tell us about your mixed-sex civil partnership plans',
'UK-News',
'The first mixed-sex couples have started to become civil partners in the '
'UK, following a landmark legal battle won by Rebecca Steinfeld and Charles '
'Ke …'),
('England ready for tortoise and hare race in second Test at Newlands',
'Sport',
'As Harold Macmillan is supposed to have explained, there are times when the '
'best‑laid plans disappear like melting snow in springtime and a whole new '
'…'),
("All Federico Fellini's films – ranked!",
'Film',
'20. The Voice of the Moon (1990) A gentle, episodic Fellini, with Roberto '
'Benigni playing Ivo, a madcap character who travels far and wide across the '
'…'),
('Isle of Wight’s rattling, rolling, charming ex-tube trains face end of the '
'line',
'UK-News',
'The train trip from Ryde Pier Head to Shanklin on the Isle of Wight in '
'carriages built 80 years ago for the small tunnels of certain London '
'Undergroun …'),
('Meaty by Samantha Irby review – scatological essays',
'Books',
'To call Samantha Irby’s book scatological would be an understatement. This '
'is a book about assholes – yes, the kind who cheats on you, or never calls, '
'…'),
("Call for more diverse Lake District sparks row over area's future",
'UK-News',
'The head of the Lake District national park authority (LDNPA) has been '
'accused of using the issue of diversity to push through commercial '
'development …'),
("Sharon Choi: how we fell for Bong Joon-ho's translator",
'Film',
'Just when you thought Bong Joon-ho – the affable maestro of Korean cinema '
'and now, with his class-conscious Cannes winner Parasite, champion of the '
'pe …'),
('Whitehall reforms may lead to discrimination, says union',
'Politics',
'Boris Johnson’s “seismic” Whitehall reforms, including regular exams for '
'senior civil servants, could lead to discrimination against staff on the '
'grou …'),
('The most exciting movies of 2020 – biopics',
'Film',
'Respect Having wiped away her Catstoddler snot, Jennifer Hudson gives her '
'pipes a wider airing in this Aretha Franklin biopic which – unlike other '
'mov …'),
('Tune-free pop and the new Katie Hopkins: our 2020 celebrity predictions',
'Life-and-Style',
'There are two ways to spend New Year’s Eve, as best as I can tell: you '
'either dirty the floor of a house party and spend the smallest of the small '
'hou …')]
Great question. Covers the basic common issues of finding selective text that many come across.
We can use find command of BeautifulSoup itself to get selective text as below:
import requests
from bs4 import BeautifulSoup
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html'
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# 1. To get PARENT TEXT only and ignore CHILD TEXT
title = soup.find('h5')
titletextonly = soup.find_all(text=True, recursive=False) # will give only parent text
#fulltitletext = soup.find_all(text=True, recursive=True) # will give all text under it including child text
titletext = "".join(titletextonly)
##################################################
# 2. Category
p_elements = soup.find_all('p',class_='metadata')
for p_element in p_elements:
p_with_span_element = p_element.find('span')
span_text = p_with_span_element.text # UK-News
##################################################
# 3. Snippet
p_snippet_text = soup.find('p', class_='snipppet').text
I'm writing a spider trulia to scrape pages of properties for sale on Trulia.com such as https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123; the current version can be found on https://github.com/khpeek/trulia-scraper.
I'm using Item Loaders and invoking the add_xpath method with the re keyword argument to specify regular expressions to extract. In the example in the documentation, there is just one group in the regular expression and one field to extract to.
However, I would actually like to define two groups and extract them to two separate Scrapy fields. Here is an 'excerpt' from the parse_property_page method:
def parse_property_page(self, response):
l = TruliaItemLoader(item=TruliaItem(), response=response)
details = l.nested_css('.homeDetailsHeading')
overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
overview.add_xpath('overview', xpath='.//li/text()')
overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')
Notice how the lot_size field has two groups extracted: one for the number, and one for the units which can be either 'acres' or 'sqft'. If I run this parse method using the command
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page
then I get the following scraped item:
# Scraped Items ------------------------------------------------------------
[{'address': '1860 Lombard St',
'area': 2524.0,
'city_state': 'San Francisco, CA 94123',
'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
'description': ['Outstanding investment opportunity to own this light-fixer '
'mixed use Marina 2-unit property w/established income and '
'not on liquefaction. The first floor of this building '
'houses a commercial business currently leased to Jigalin '
'Fitness until 2018. The second floor presents a 2bed/1bath '
'apartment fully outfitted in a contemporary design w/full '
'kitchen, 10ft high ceilings & laundry area. The apartment '
'will be delivered vacant. The structure has undergone '
'renovation & features concrete perimeter foundation, '
'reinforced walls, ADA compliant commercial restroom, '
'electrical updates & rolling door. This property makes an '
"ideal investment with instant cash flow. Don't let this "
'pass you by. As-Is sale.'],
'events': ['Sold', 'Sold', 'Sold'],
'listing_information': ['2 Bedrooms', 'Multi-Family'],
'listing_information_date_updated': '11/03/2017',
'lot_size': ['1620', 'sqft'],
'neighborhood': 'Marina',
'overview': ['Multi-Family',
'2 Beds',
'Built in 1908',
'1 days on Trulia',
'1620 sqft lot size',
'2,524 sqft',
'$711/sqft'],
'prices': ['$850,000', '$1,350,000', '$1,200,000'],
'public_records': ['1 Bathroom',
'Multi-Family',
'1,296 Square Feet',
'Lot Size: 1,620 sqft'],
'public_records_date_updated': '07/01/2017',
'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]
where the lot_size field is a list with the number and the unit. However, I'd ideally like to extract the unit (acres or sqft) to a separate field lot_size_units. I could do this by first loading the item and doing my own processing, but I was wondering whether there is a more Scrapy-native way to 'unpack' the matched groups into different items?
(I've perused the get_value method on https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py, but this hasn't 'shown me the way' yet if there is any).
You could try this (ignoring one group at a time):
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')