I have a list of the following authors for Google Scholar papers: Zoe Pikramenou, James H. R. Tucker, Alison Rodger, Timothy Dafforn. I want to extract and print titles for the papers present for at least 3 of these.
You can get a dictionary of paper info from each author using Scholarly:
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
print(author)
The output looks something like (just a small excerpt from what you'd get from one author)
{'bib': {'cites': '69',
'title': 'Chalearn looking at people and faces of the world: Face '
'analysis workshop and challenge 2016',
'year': '2016'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:_FxGoFyzp5QC',
'source': 'citations'},
{'bib': {'cites': '21',
'title': 'The NoXi database: multimodal recordings of mediated '
'novice-expert interactions',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:0EnyYjriUFMC',
'source': 'citations'},
{'bib': {'cites': '11',
'title': 'Automatic habitat classification using image analysis and '
'random forest',
'year': '2014'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:qjMakFHDy7sC',
'source': 'citations'},
{'bib': {'cites': '10',
'title': 'AutoRoot: open-source software employing a novel image '
'analysis approach to support fully-automated plant '
'phenotyping',
'year': '2017'},
'filled': False,
'id_citations': 'ZhUEBpsAAAAJ:hqOjcs7Dif8C',
'source': 'citations'}
How can I collect the bib and specifically title for papers which are present for three or more out of the four authors?
EDIT: in fact it's been pointed out id_citations is not unique for each paper, my mistake. Better to just use title itself
Expanding on my comment, you can achieve this using Pandas groupby:
import pandas as pd
from scholarly import scholarly
AuthorList = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
frames = []
for Author in AuthorList:
search_query = scholarly.search_author(Author)
author = next(search_query).fill()
# creating DataFrame with authors
df = pd.DataFrame([x.__dict__ for x in author.publications])
df['author'] = Author
frames.append(df.copy())
# joining all author DataFrames
df = pd.concat(frames, axis=0)
# taking bib dict into separate columns
df[['title', 'cites', 'year']] = pd.DataFrame(df.bib.to_list())
# counting unique authors attached to each title
n_authors = df.groupby('title').author.nunique()
# locating the unique titles for all publications with n_authors >= 2
output = n_authors[n_authors >= 2].index
This finds 202 papers which have 2 or more of the authors in that list (out of 774 total papers). Here is an example of the output:
Index(['1, 1′-Homodisubstituted ferrocenes containing adenine and thymine nucleobases: synthesis, electrochemistry, and formation of H-bonded arrays',
'722: Iron chelation by biopolymers for an anti-cancer therapy; binding up the'ferrotoxicity'in the colon',
'A Luminescent One-Dimensional Copper (I) Polymer',
'A Unidirectional Energy Transfer Cascade Process in a Ruthenium Junction Self-Assembled by r-and-Cyclodextrins',
'A Zinc(II)-Cyclen Complex Attached to an Anthraquinone Moiety that Acts as a Redox-Active Nucleobase Receptor in Aqueous Solution',
'A ditopic ferrocene receptor for anions and cations that functions as a chromogenic molecular switch',
'A ferrocene nucleic acid oligomer as an organometallic structural mimic of DNA',
'A heterodifunctionalised ferrocene derivative that self-assembles in solution through complementary hydrogen-bonding interactions',
'A locking X-ray window shutter and collimator coupling to comply with the new Health and Safety at Work Act',
'A luminescent europium hairpin for DNA photosensing in the visible, based on trimetallic bis-intercalators',
...
'Up-Conversion Device Based on Quantum Dots With High-Conversion Efficiency Over 6%',
'Vectorial Control of Energy‐Transfer Processes in Metallocyclodextrin Heterometallic Assemblies',
'Verteporfin selectively kills hypoxic glioma cells through iron-binding and increased production of reactive oxygen species',
'Vibrational Absorption from Oxygen-Hydrogen (Oi-H2) Complexes in Hydrogenated CZ Silicon',
'Virginia review of sociology',
'Wildlife use of log landings in the White Mountain National Forest',
'Yttrium 1995',
'ZUSCHRIFTEN-Redox-Switched Control of Binding Strength in Hydrogen-Bonded Metallocene Complexes Stichworter: Carbonsauren. Elektrochemie. Metallocene. Redoxchemie …',
'[2] Rotaxanes comprising a macrocylic Hamilton receptor obtained using active template synthesis: synthesis and guest complexation',
'pH-controlled delivery of luminescent europium coated nanoparticles into platelets'],
dtype='object', name='title', length=202)
Since all of the data is in Pandas, you can also explore what the attached authors on each of the papers is as well as all of the other information you have access to within the author.publications array coming in from scholarly.
First, let's convert this into a more friendly format. You say that the id_citations is unique for each paper, so we'll use it as a hashtable/dict key.
We can then map each id_citation to the bib dict and author(s) it appears for, as a list of tuples (bib, author_name).
author_list = ['Zoe Pikramenou', 'James H. R. Tucker', 'Alison Rodger', 'Timothy Dafforn']
bibs = {}
for author_name in author_list:
search_query = scholarly.search_author(author_name)
for bib in search_query:
bib = bib.fill()
bibs.setdefault(bib['id_citations'], []).append((bib, author_name))
Thereafter, we can sort the keys in bibs based on how many authors are attached to them:
most_cited = sorted(bibs.items(), key=lambda k: len(k[1]))
# most_cited is now a list of tuples (key, value)
# which maps to (id_citation, [(bib1, author1), (bib2, author2), ...])
and/or filter that list to citations that have only three or more appearances:
cited_enough = [tup[1][0][0] for tup in most_cited if len(tup[1]) >= 3]
# using key [0] in the middle is arbitrary. It can be anything in the
# list, provided the bib objects are identical, but index 0 is guaranteed
# to be there.
# otherwise, the first index is to grab the list rather than the id_citation,
# and the last index is to grab the bib, rather than the author_name
and now we can retrieve the titles of the papers from there:
paper_titles = [bib['bib']['title'] for bib in cited_enough]
Related
import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."
You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.
So I have a list of dictionaries. Here are some of the entries in the dictionary that I'm trying to search through.
[{
'title': '"Adult" Pimiento Cheese ',
'categories': [
'Cheese',
'Vegetable',
'No-Cook',
'Vegetarian',
'Quick & Easy',
'Cheddar',
'Hot Pepper',
'Winter',
'Gourmet',
'Alabama',
],
'ingredients': [
'2 or 3 large garlic cloves',
'a 2-ounce jar diced pimientos',
'3 cups coarsely grated sharp Cheddar (preferably English, Canadian, or Vermont; about 12 ounces)'
,
'1/3 to 1/2 cup mayonnaise',
'crackers',
'toasted baguette slices',
"crudit\u00e9s",
],
'directions': ['Force garlic through a garlic press into a large bowl and stir in pimientos with liquid in jar. Add Cheddar and toss mixture to combine well. Stir in mayonnaise to taste and season with freshly ground black pepper. Cheese spread may be made 1 day ahead and chilled, covered. Bring spread to room temperature before serving.'
, 'Serve spread with accompaniments.'],
'rating': 3.125,
}, {
'title': '"Blanketed" Eggplant ',
'categories': [
'Tomato',
'Vegetable',
'Appetizer',
'Side',
'Vegetarian',
'Eggplant',
'Pan-Fry',
'Vegan',
"Bon App\u00e9tit",
],
'ingredients': [
'8 small Japanese eggplants, peeled',
'16 large fresh mint leaves',
'4 large garlic cloves, 2 slivered, 2 flattened',
'2 cups olive oil (for deep frying)',
'2 pounds tomatoes',
'7 tablespoons extra-virgin olive oil',
'1 medium onion, chopped',
'6 fresh basil leaves',
'1 tablespoon dried oregano',
'1 1/2 tablespoons drained capers',
],
'directions': ['Place eggplants on double thickness of paper towels. Salt generously. Let stand 1 hour. Pat dry with paper towels. Cut 2 deep incisions in each eggplant. Using tip of knife, push 1 mint leaf and 1 garlic sliver into each incision.'
,
"Pour 2 cups oil into heavy medium saucepan and heat to 375\u00b0F. Add eggplants in batches and fry until deep golden brown, turning occasionally, about 4 minutes. Transfer eggplants to paper towels and drain."
,
'Blanch tomatoes in pot of boiling water for 20 seconds. Drain. Peel tomatoes. Cut tomatoes in half; squeeze out seeds. Chop tomatoes; set aside.'
,
"Heat 4 tablespoons extra-virgin olive oil in large pot over high heat. Add 2 flattened garlic cloves; saut\u00e9 until light brown, about 3 minutes. Discard garlic. Add onion; saut\u00e9 until translucent, about 5 minutes. Add reduced to 3 cups, stirring occasionally, about 20 minutes."
,
'Mix capers and 3 tablespoons extra-virgin olive oil into sauce. Season with salt and pepper. Reduce heat. Add eggplants. Simmer 5 minutes, spooning sauce over eggplants occasionally. Spoon sauce onto platter. Top with eggplants. Serve warm or at room temperature.'
],
'rating': 3.75,
'calories': 1386.0,
'protein': 9.0,
'fat': 133.0,
}]
I have the current code that is searching through the dictionary and creating a list of recipes that contain all the words in the query argument.
function to find the matching recipes and return them in a list of dictionaries. tokenisation is another function that basically removes all punctuation and digits from the query as well as make it lower case. It returns a list of each word found in the query.
For example, the query "cheese!22banana" would be turned to [cheese, banana].
def matching(query):
#split up the input string and have a list to put the recipes in
token_list = tokenisation(query)
matching_recipes = []
#loop through whole file
for recipe in recipes:
recipe_tokens = []
#check each key
for key in recipe:
#checking the keys for types
if type(recipe[key]) != list:
continue
#look at the values for each key
for sentence in recipe[key]:
#make a big list of tokens from the keys
recipe_tokens.extend([t for t in tokenisation(sentence)])
#checking if all the query tokens appear in the recipe, if so append them
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
The issue I am having is that the first key in the dictionary isn't a list, so as a result the function isn't checking if the words appear in the title and instead is just checking every other key and then adding each word to a list. I then check if every word in the query is present in the list of words in the recipes. The issue I am having is that because it's skipping the title as it's not in a list, if the word appears in the title, it won't return the recipe.
How would I add this title check into this code? I've tried turning it into a list as the title current has type string but then get a 'float' is not iterable error and have no clue how about tackling this issue.
To avoid the error, simply replace the
if type(recipe[key]) != list:
to
if type(recipe[key]) == str:
Or better,
if isinstance(value, str):
You get the error from trying to use the tokenisation function on certain values, because there are values in the dicts that are indeed of type float, for example, the value of the 'rating' key.
If the tokenization function returns a list of sentences, this should work:
def matching(query):
token_list = tokenisation(query)
matching_recipes = []
for recipe in recipes:
recipe_tokens = []
for value in recipe.values():
if isinstance(value, str):
recipe_tokens.append(value)
continue
for sentence in value:
recipe_tokens.extend(tokenisation(sentence))
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
If it returns a list of words:
def matching(query):
token_list = tokenisation(query)
matching_recipes = []
for recipe in recipes:
recipe_tokens = []
for value in recipe.values():
if isinstance(value, str):
value = tokenisation(value)
for sentence in value:
recipe_tokens.extend(tokenisation(sentence))
if all([tl in recipe_tokens for tl in token_list]):
matching_recipes.append(recipe)
return matching_recipes
I am trying to scrape some data from this books site. I need to extract the title, and the author(s). I was able to extract the titles without much trouble. However, I am having issues to extract the authors when there are more than one, since they appear in the same line, and they belong to separate anchor tags within a header h4.
<h4>
"5
. "
The Elements of Style
" by "
William Strunk, Jr
", "
E. B. White
</h4>
This is what I tried:
book_container = soup.find_all('li', class_='item pb-3 pt-3 border-bottom')
for container in book_container:
# title
title = container.h4.a.text
titles.append(title)
# author(s)
author_s = container.h4.find_all('a')
print('### SECOND FOR LOOP ###')
for a in author_s:
if a['href'].startswith('/authors/'):
print(a.text)
I'd like to have two authors in a tuple.
You can extract all <a> links under <h4> (h4 is the tag where are title/authors). First <a> tag is the title, rest of <a> tags are the authors:
import requests
from bs4 import BeautifulSoup
url = 'https://thegreatestbooks.org/the-greatest-nonfiction-since/1900'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for item in soup.select('h4:has(>a)'):
elements = [i.get_text(strip=True) for i in item.select('a')]
title = elements[0]
authors = elements[1:]
print('{:<40} {}'.format(title, authors))
Prints:
The Diary of a Young Girl ['Anne Frank']
The Autobiography of Malcolm X ['Alex Haley']
Silent Spring ['Rachel Carson']
In Cold Blood ['Truman Capote']
The Elements of Style ['William Strunk, Jr', 'E. B. White']
The Double Helix: A Personal Account of the Discovery of the Structure of DNA ['James D. Watson']
Relativity ['Albert Einstein']
Look Homeward, Angel ['Thomas Wolfe']
Homage to Catalonia ['George Orwell']
Speak, Memory ['Vladimir Nabokov']
The General Theory of Employment, Interest and Money ['John Maynard Keynes']
The Second World War ['Winston Churchill']
The Education of Henry Adams ['Henry Adams']
Out of Africa ['Isak Dinesen']
The Structure of Scientific Revolutions ['Thomas Kuhn']
Dispatches ['Michael Herr']
The Gulag Archipelago ['Aleksandr Solzhenitsyn']
I Know Why the Caged Bird Sings ['Maya Angelou']
The Civil War ['Shelby Foote']
If This Is a Man ['Primo Levi']
Collected Essays of George Orwell ['George Orwell']
The Electric Kool-Aid Acid Test ['Tom Wolfe']
Civilization and Its Discontents ['Sigmund Freud']
The Death and Life of Great American Cities ['Jane Jacobs']
Selected Essays of T. S. Eliot ['T. S. Eliot']
A Room of One's Own ['Virginia Woolf']
The Right Stuff ['Tom Wolfe']
The Road to Serfdom ['Friedrich von Hayek']
R. E. Lee ['Douglas Southall Freeman']
The Varieties of Religious Experience ['Will James']
The Liberal Imagination ['Lionel Trilling']
Angela's Ashes: A Memoir ['Frank McCourt']
The Second Sex ['Simone de Beauvoir']
Mere Christianity ['C. S. Lewis']
Moveable Feast ['Ernest Hemingway']
The Autobiography of Alice B. Toklas ['Gertrude Stein']
The Origins of Totalitarianism ['Hannah Arendt']
Black Lamb and Grey Falcon ['Rebecca West']
Orthodoxy ['G. K. Chesterton']
Philosophical Investigations ['Ludwig Wittgenstein']
Night ['Elie Wiesel']
The Affluent Society ['John Kenneth Galbraith']
Mythology ['Edith Hamilton']
The Open Society ['Karl Popper']
The Color of Water: A Black Man's Tribute to His White Mother ['James McBride']
The Seven Storey Mountain ['Thomas Merton']
Hiroshima ['John Hersey']
Let Us Now Praise Famous Men ['James Agee']
Pragmatism ['Will James']
The Making of the Atomic Bomb ['Richard Rhodes']
This might not be the most pythonic way, but it's a workaround.
newlist = []
for a in author_s:
if a['href'].startswith('/authors/'):
if len(author_s)>2:
newlist.append(a.text)
print(tuple(newlist))
else:
print(a.text)
I'm utilizing the fact that variable author_s would contain a list which we could check for more names. More than 2 in list, means more authors. (Alternatively, you could also check for the existence of newline in print)
You will also notice the printed output will have two tuples. Always extract the second one. The rest with one author will remain the same. Since this request do not have multiple lines of two authors, I couldn't check for complications.
Output:
[The Elements of Style, William Strunk, Jr, E. B. White]
### SECOND FOR LOOP ###
('William Strunk, Jr',)
('William Strunk, Jr', 'E. B. White')
I would like to grab some text from a webpage of a medical document for a Natural Language Processing project and am having issues extracting the necessary information using BeautifulSoup. The website I am viewing can be found at the address: https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=2332-Abdominal%20Abscess%20I&D
What I would like to do is grab the entire text body from this page and doing so with my cursor and simply applying a copy/paste would give me the appropriate text I am interested in:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Abdominal Abscess I&D
Description: Incision and drainage (I&D) of abdominal abscess, excisional debridement of nonviable and viable skin, subcutaneous tissue and muscle, then removal of foreign body.
(Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS: Abdominal wall abscess.
... (body text) ...
The finished wound size was 9.0 x 5.3 x 5.2 cm in size. Patient tolerated the procedure well. Dressing was applied, and he was taken to recovery room in stable condition.
However, I would like to implement this using BeautifulSoup because I would like to perform a loop to grab multiple medical documents from the same website.
import requests
r = requests.get('https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=2332-Abdominal%20Abscess%20I&D')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all('div', attrs={'id':'sampletext'})
# Here I am able to specify the <h1> tag to get 'Sample Type / Medical Specialty' as well as 'Sample Name' text fields
record.find('h1').text.replace('\n', ' ')
However, I cannot replicate this for the remaining text (i.e. Description, PREOPERATIVE DIAGNOSIS, POSTOPERATIVE DIAGNOSIS, Procedure, etc.) as there are no unique tags to identify these text fields
If anyone is familiar with web-scraping concepts using BeautifulSoup I would appreciate any feedback! Again my goal is to obtain the full text from the webpage which I would ultimately like to add to a Pandas Dataframe. Thanks!
Ok, it took me a while, but there isn't an easy way of extracting usable text unless you manually iterate over all elements:
import requests
import re
from bs4 import BeautifulSoup, Tag, NavigableString, Comment
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=2332-Abdominal%20Abscess%20I&D'
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
so far nothing special.
title_el = soup.find('h1')
page_title = title_el.text.strip()
first_hr = title_el.find_next_sibling('hr')
description_title = title_el.find_next_sibling('b', text=re.compile('description', flags=re.I))
description_text_parts = []
for s in description_title.next_siblings:
if s is first_hr:
break
if isinstance(s, Tag):
description_text_parts.append(s.text.strip())
elif isinstance(s, NavigableString):
description_text_parts.append(str(s).strip())
description_text = '\n'.join(p for p in description_text_parts if p.strip())
here we get page_title from <h1>
'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Abdominal Abscess I&D'
and description by walking the elements after we see the text Description:.
'Incision and drainage (I&D) of abdominal abscess, excisional debridement of nonviable and viable skin, subcutaneous tissue and muscle, then removal of foreign body.\n(Medical Transcription Sample Report)'
Now, all titles are placed under the horizontal rule:
# titles are all bold and uppercase
titles = [b for b in first_hr.find_next_siblings('b') if b.text.strip().isupper()]
We find the text between the titles and assign it to the title we see earlier
docs = []
for t in titles:
text_parts = []
for s in t.next_siblings:
# go until next title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, Tag):
if s.name == 'div':
break
text_parts.append(s.text.strip())
elif isinstance(s, NavigableString):
text_parts.append(str(s).strip())
text = '\n'.join(p for p in text_parts if p.strip())
docs.append({
'title': t.text.strip(),
'text': text
})
printing docs gives:
[
{'title': 'PREOPERATIVE DIAGNOSIS:', 'text': 'Abdominal wall abscess.'},
{'title': 'POSTOPERATIVE DIAGNOSIS:', 'text': 'Abdominal wall abscess.'},
{'title': 'PROCEDURE:', 'text': 'Incision and drainage (I&D) of abdominal abscess, excisional debridement of nonviable and viable skin, subcutaneous tissue and muscle, then removal of foreign body.'},
{'title': 'ANESTHESIA:', 'text': 'LMA.'},
{'title': 'INDICATIONS:', 'text': 'Patient is a pleasant 60-year-old gentleman, who initially had a sigmoid colectomy for diverticular abscess, subsequently had a dehiscence with evisceration. Came in approximately 36 hours ago with pain across his lower abdomen. CT scan demonstrated presence of an abscess beneath the incision. I recommended to the patient he undergo the above-named procedure. Procedure, purpose, risks, expected benefits, potential complications, alternatives forms of therapy were discussed with him, and he was agreeable to surgery.'},
{'title': 'FINDINGS:', 'text': 'The patient was found to have an abscess that went down to the level of the fascia. The anterior layer of the fascia was fibrinous and some portions necrotic. This was excisionally debrided using the Bovie cautery, and there were multiple pieces of suture within the wound and these were removed as well.'},
{'title': 'TECHNIQUE:', 'text': 'Patient was identified, then taken into the operating room, where after induction of appropriate anesthesia, his abdomen was prepped with Betadine solution and draped in a sterile fashion. The wound opening where it was draining was explored using a curette. The extent of the wound marked with a marking pen and using the Bovie cautery, the abscess was opened and drained. I then noted that there was a significant amount of undermining. These margins were marked with a marking pen, excised with Bovie cautery; the curette was used to remove the necrotic fascia. The wound was irrigated; cultures sent prior to irrigation and after achievement of excellent hemostasis, the wound was packed with antibiotic-soaked gauze. A dressing was applied. The finished wound size was 9.0 x 5.3 x 5.2 cm in size. Patient tolerated the procedure well. Dressing was applied, and he was taken to recovery room in stable condition.'}
]
I'm writing a spider trulia to scrape pages of properties for sale on Trulia.com such as https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123; the current version can be found on https://github.com/khpeek/trulia-scraper.
I'm using Item Loaders and invoking the add_xpath method with the re keyword argument to specify regular expressions to extract. In the example in the documentation, there is just one group in the regular expression and one field to extract to.
However, I would actually like to define two groups and extract them to two separate Scrapy fields. Here is an 'excerpt' from the parse_property_page method:
def parse_property_page(self, response):
l = TruliaItemLoader(item=TruliaItem(), response=response)
details = l.nested_css('.homeDetailsHeading')
overview = details.nested_xpath('.//span[contains(text(), "Overview")]/parent::div/following-sibling::div[1]')
overview.add_xpath('overview', xpath='.//li/text()')
overview.add_xpath('area', xpath='.//li/text()', re=r'([\d,]+) sqft$')
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (acres|sqft) lot size$')
Notice how the lot_size field has two groups extracted: one for the number, and one for the units which can be either 'acres' or 'sqft'. If I run this parse method using the command
scrapy parse https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123 --spider=trulia --callback=parse_property_page
then I get the following scraped item:
# Scraped Items ------------------------------------------------------------
[{'address': '1860 Lombard St',
'area': 2524.0,
'city_state': 'San Francisco, CA 94123',
'dates': ['10/22/2002', '04/25/2002', '03/20/2000'],
'description': ['Outstanding investment opportunity to own this light-fixer '
'mixed use Marina 2-unit property w/established income and '
'not on liquefaction. The first floor of this building '
'houses a commercial business currently leased to Jigalin '
'Fitness until 2018. The second floor presents a 2bed/1bath '
'apartment fully outfitted in a contemporary design w/full '
'kitchen, 10ft high ceilings & laundry area. The apartment '
'will be delivered vacant. The structure has undergone '
'renovation & features concrete perimeter foundation, '
'reinforced walls, ADA compliant commercial restroom, '
'electrical updates & rolling door. This property makes an '
"ideal investment with instant cash flow. Don't let this "
'pass you by. As-Is sale.'],
'events': ['Sold', 'Sold', 'Sold'],
'listing_information': ['2 Bedrooms', 'Multi-Family'],
'listing_information_date_updated': '11/03/2017',
'lot_size': ['1620', 'sqft'],
'neighborhood': 'Marina',
'overview': ['Multi-Family',
'2 Beds',
'Built in 1908',
'1 days on Trulia',
'1620 sqft lot size',
'2,524 sqft',
'$711/sqft'],
'prices': ['$850,000', '$1,350,000', '$1,200,000'],
'public_records': ['1 Bathroom',
'Multi-Family',
'1,296 Square Feet',
'Lot Size: 1,620 sqft'],
'public_records_date_updated': '07/01/2017',
'url': 'https://www.trulia.com/property/1072559047-1860-Lombard-St-San-Francisco-CA-94123'}]
where the lot_size field is a list with the number and the unit. However, I'd ideally like to extract the unit (acres or sqft) to a separate field lot_size_units. I could do this by first loading the item and doing my own processing, but I was wondering whether there is a more Scrapy-native way to 'unpack' the matched groups into different items?
(I've perused the get_value method on https://github.com/scrapy/scrapy/blob/129421c7e31b89b9b0f9c5f7d8ae59e47df36091/scrapy/loader/init.py, but this hasn't 'shown me the way' yet if there is any).
You could try this (ignoring one group at a time):
overview.add_xpath('lot_size', xpath='.//li/text()', re=r'([\d,]+) (?:acres|sqft) lot size$')
overview.add_xpath('lot_size_units', xpath='.//li/text()', re=r'(?:[\d,]+) (acres|sqft) lot size$')