I am making a webscrapper for Bookdepository and I came across a problem with the html elements of the site. The page for a book has a section called Product Details and I need to take each element from the list. However some of the elements, not all, like Language have this structure
sample image. How is it possible to get this element?
My work in progress is this. Thanks a lot in advance
import bs4
from urllib.request import urlopen
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img
book_language = book_info.find_next(string='Language',)
book_format = book_info.find_all(string='Format', )
print('Number of Pages: ' + book_pages.strip())
print('ISBN Number: ' + book_ibsn)
print('Publication Date: ' + book_publication_date)
print('Publisher Name: ' + book_publisher.strip())
print('Author: '+ book_author.strip())
print(book_cover)
print(book_language)
print(book_format)
To get the corresponding <span> to your label you could go with:
book_info.find_next(string='Language').find_next('span').get_text(strip=True)
A more generic approach to get all these product details could be:
import bs4, re
from urllib.request import urlopen
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book = {
'description':soup.find('div', class_='item-excerpt trunc').get_text(strip=True),
'title':soup.find('h1').text
}
book.update({e.label.text.strip():re.sub('\s+', ' ',e.span.text).strip() for e in soup.select('.biblio-info li')})
book
Output:
{'description': "'A breathtaking memoir...I was so moved by this book.' Oprah'It is startlingly honest and, at times, a jaw-dropping read, charting her rise from poverty and abuse to becoming the first African-American to win the triple crown of an Oscar, Emmy and Tony for acting.' BBC NewsTHE DEEPLY PERSONAL, BRUTALLY HONEST ACCOUNT OF VIOLA'S INSPIRING LIFEIn my book, you will meet a little girl named Viola who ran from her past until she made a life changing decision to stop running forever.This is my story, from a crumbling apartment in Central Falls, Rhode Island, to the stage in New York City, and beyond. This is the path I took to finding my purpose and my strength, but also to finding my voice in a world that didn't always see me.As I wrote Finding Me, my eyes were open to the truth of how our stories are often not given close examination. They are bogarted, reinvented to fit into a crazy, competitive, judgmental world. So I wrote this for anyone who is searching for a way to understand and overcome a complicated past, let go of shame, and find acceptance. For anyone who needs reminding that a life worth living can only be born from radical honesty and the courage to shed facades and be...you.Finding Me is a deep reflection on my past and a promise for my future. My hope is that my story will inspire you to light up your own life with creative expression and rediscover who you were before the world put a label on you.show more",
'title': 'Finding Me : A Memoir - THE INSTANT SUNDAY TIMES BESTSELLER',
'Format': 'Hardback | 304 pages',
'Dimensions': '160 x 238 x 38mm | 520g',
'Publication date': '26 Apr 2022',
'Publisher': 'Hodder & Stoughton',
'Imprint': 'Coronet Books',
'Publication City/Country': 'London, United Kingdom',
'Language': 'English',
'ISBN10': '1399703994',
'ISBN13': '9781399703994',
'Bestsellers rank': '31'}
You can check if the label text is equal to Language and then print the text. I have also added a better approach to parse the product details section in a single iteration.
Check the code given below:-
import bs4
from urllib.request import urlopen
import re
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_info = soup.find('ul', class_='biblio-info')
lis=book_info.find_all('li')
# Check if the label name is Language and the print the span text
for val in lis:
label=val.find('label')
if label.text.strip()=='Language':
span=val.find('span')
span_text=(span.text.strip())
print('Language--> '+span_text)
# A better approach to get all the Name and value pairs in the Product details section in a single iteration
for val in lis:
label=val.find('label')
span=val.find('span')
span_text=(span.text.strip())
modified_text = re.sub('\n', ' ', span_text)
modified_text = re.sub(' +', ' ', modified_text)
print(label.text.strip()+'--> '+modified_text)
You can grab desired data from details portion using css selectors
import bs4
from urllib.request import urlopen
import re
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
#print(book_urls)
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img.get('src')
book_language =soup.select_one('.biblio-info > li:nth-child(7) span').get_text(strip=True)
book_format = soup.select_one('.biblio-info > li:nth-child(1) span').get_text(strip=True)
book_format = re.sub(r'\s+', ' ',book_format).replace('|','')
print('Number of Pages: ' + book_pages.strip())
print('ISBN Number: ' + book_ibsn)
print('Publication Date: ' + book_publication_date)
print('Publisher Name: ' + book_publisher.strip())
print('Author: '+ book_author.strip())
print(book_cover)
print(book_language)
print(book_format)
Output:
Number of Pages: 304 pages
ISBN Number: 9781399703994
Publication Date: 26 Apr 2022
Publisher Name: Hodder & Stoughton
Author: Viola Davis
https://d1w7fb2mkkr3kw.cloudfront.net/assets/images/book/lrg/9781/3997/9781399703994.jpg
English
Hardback 304 pages
Related
I want to prepare a dataframe of universities, its abbrevations and website link.
My code:
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)
Present answer:
ValueError: No tables found
Expected answer:
df =
| | university_full_name | uni_abb | uni_url|
---------------------------------------------------------------------
| 0 | Albert Einstein College of Medicine | AECOM | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|
That's one funky page you have there...
First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.
So you need to bring in the heavy artillery: xpath...
import pandas as pd
import requests
from lxml import html as lh
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[#class="mw-headline"]]//following-sibling::ul//li'):
info = uni.text.split(' – ')
abb = info[0]
#for those w/ no links
if not uni.xpath('.//a'):
rows.append((abb," ",info[1]))
#now to account for those using the same abbreviation for multiple teams
for a in uni.xpath('.//a'):
dat = a.xpath('./#*')
#for those with redirects
if len(dat)==3:
del dat[1]
link = f"https://en.wikipedia.org{dat[0]}"
rows.append((abb,link,dat[1]))
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df
Output:
abb url full name
0 AECOM https://en.wikipedia.org/wiki/Albert_Einstein_... Albert Einstein College of Medicine
1 AFA https://en.wikipedia.org/wiki/United_States_Ai... United States Air Force Academy
etc.
Note: you can rearrange the order of columns in the dataframe, if you are so inclined.
Select and iterate only the expected <li> and extract its information, but be aware there is a university without an <a> (SUI – State University of Iowa), so this should be handled with if-statement in example:
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)
data = []
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
pd.DataFrame(data)
Output:
abb
full_name
url
0
AECOM
Albert Einstein College of Medicine
https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine
1
AFA
United States Air Force Academy
https://en.wikipedia.org/wiki/United_States_Air_Force_Academy
2
Annapolis
U.S. Naval Academy
https://en.wikipedia.org/wiki/United_States_Naval_Academy
3
A&M
Texas A&M University, but also others; see A&M
https://en.wikipedia.org/wiki/Texas_A%26M_University
4
A&M-CC or A&M-Corpus Christi
Corpus Christi
https://en.wikipedia.org/wiki/Texas_A%26M_University%E2%80%93Corpus_Christi
...
There are no tables on this page, but lists. So the goal will be to go through the <ul> and then <li> tags, skipping the paragraphs you are not interested in (the first and those after the 26th).
You can extract aab_code of the university this way:
uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]
while to get the url you have to access the 'href' and 'title' parameter inside the <a> tag:
for a in li.find_all('a', href=True):
title = a['title']
url= f"https://en.wikipedia.org/{a['href']}"
Accumulate the extracted information into a list, and finally create the dataframe by assigning appropriate column names.
Here is the complete code, in which I use BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
for li in ul.find_all("li"):
uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
for a in li.find_all('a', href=True):
l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])
Result:
university_full_name uni_abb uni_url
0 Albert Einstein College of Medicine AECOM https://en.wikipedia.org//wiki/Albert_Einstein...
1 United States Air Force Academy AFA https://en.wikipedia.org//wiki/United_States_A...
I have some code that goes through the cast list of a show or movie on Wikipedia. Scraping all the actor's names and storing them. The current code I have finds all the <a> in the list and stores their title tags. It currently goes:
from bs4 import BeautifulSoup
URL = input()
website_url = requests.get(URL).text
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').find_all('a'):
title = x.get('title')
print (title)
if title is not None:
Stars.append(title)
else:
continue
While this partially works there are two downsides:
It doesn't work if the actor doesn't have a Wikipedia page hyperlink.
It also scrapes any other hyperlink title it finds. e.g. https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull returns ['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']
Is there a way I can get BeautifulSoup to scrape the first two Words after each <li>? Or even a better solution for what I am trying to do?
You can use css selectors to grab only the first <a> in a <li>:
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Example
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Stars.append(x.get('title'))
Stars
Output
['Harrison Ford',
'Cate Blanchett',
'Karen Allen',
'Ray Winstone',
'John Hurt',
'Jim Broadbent',
'Shia LaBeouf']
You can use Regex to fetch all the names from the text content of <li/> and just take the first two names and it will also fix the issue in case the actor doesn't have a Wikipedia page hyperlink
import re
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)", <text_content_from_li>)
Example:
text = "Cate Blanchett as Irina Spalko, a villainous Soviet agent. Screenwriter David Koepp created the character."
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)",text)
Output:
[('Cate', 'Blanchett'), ('Irina', 'Spalko'), ('Screenwriter', 'David')]
There is considerable variation for the html for cast within the film listings on Wikipaedia. Perhaps look to an API to get this info?
E.g. imdb8 allows for a reasonable number of calls which you could use with the following endpoint
https://imdb8.p.rapidapi.com/title/get-top-cast
There also seems to be Python IMDb API
Or choose something with more regular html. For example, if you take the imdb film ids in a list you can extract full cast and main actors, from IMDb as follows. To get the shorter cast list I am filtering out the rows which occur at/after the text "Rest" within "Rest of cast listed alphabetically:"
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
movie_ids = ['tt0367882', 'tt7126948']
base = 'https://www.imdb.com'
with requests.Session() as s:
for movie_id in movie_ids:
link = f'https://www.imdb.com/title/{movie_id}/fullcredits?ref_=tt_cl_sm'
# print(link)
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
full_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list [href*=name]:has(img)')]
main_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list tr:not(:has(.castlist_label:contains(cast)) ~ tr, :has(.castlist_label:contains(cast))) [href*=name]:has(img)')]
df_full = pd.DataFrame(full_cast, columns = ['Actor', 'Link'])
df_main = pd.DataFrame(main_cast, columns = ['Actor', 'Link'])
# print(df_full)
print(df_main)
i know a very little about backend programming, i wanted to scrape the data from Delhi Fire Service for my academic project, there are online fire reports available for zone wise of Delhi. for each zone lots of files available
by the way if you directly go to this link you will get an empty page(i don't know why). Further, now if i click on one file it will open like this
and now there is pattern in link, each time report number changes, rest link remain the same, so i obtained all the links for scraping. The problem is i am facing is when i load the link using beautifulSoup i am not getting the same content of that report if i load the same link on browser
import bs4 as bs
import urllib.request
import requests
with open("p.html",'r') as f:
page = f.read()
soup = bs.BeautifulSoup(page,'lxml')
links =soup.find_all('a')
urls=[]
for link in links:
urls.append(link.get('href'))
string1="http://delhigovt.nic.in/FireReport/"
# print(urls)
link1 = string1 + urls[1]
print(link1)
sauce = requests.get(link1)
soup = bs.BeautifulSoup(sauce.content,'lxml')
print(soup)
and it is random, sometimes if a copy the link and load it in new tab(or other browser) it convert to error page so i lost the report information, i am not able to scrape the data this way even if i have all the links for all the report. can someone tell me what is going on. Thank You
Update - link http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public
there you have to select option "no" in right corner to able to get the "search" button
To scrape the page, you need to use requests.session to set the cookies correctly. Also there's parameter ud in the POST request that the page use and needs to be correctly set.
For example (this scrapes all stations and reports and stores it in dictionary data):
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public'
post_url = 'http://delhigovt.nic.in/FireReport/a_publicSearch.asp'
params = {'ud': '',
'fstation': '',
'caller': '',
'add': '',
'frmdate': '',
'todate': '',
'save': 'Search'}
def open_report(s, url):
url = 'http://delhigovt.nic.in/FireReport/' + url
print(url)
soup = BeautifulSoup(s.get(url).content, 'lxml')
# just return some text here
return soup.select('body > table')[1].get_text(strip=True, separator=' ')
data = {}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'lxml')
stations = {}
for option in soup.select('select[name="fstation"] option[value]:not(:contains("Select Fire Station"))'):
stations[option.get_text(strip=True)] = option['value']
params['ud'] = soup.select_one('input[name="ud"][value]')['value']
for k, v in stations.items():
print('Scraping station {} id={}'.format(k, v))
params['fstation'] = int(v)
soup = BeautifulSoup( s.post(post_url, data=params).content, 'lxml' )
for tr in soup.select('tr:has(> td > a[href^="f_publicReport.asp?rep_no="])'):
no, fire_report_no, date, address = tr.select('td')
link = fire_report_no.a['href']
data.setdefault(k, [])
data[k].append( (no.get_text(strip=True), fire_report_no.get_text(strip=True), date.get_text(strip=True), address.get_text(strip=True), link, open_report(s, link)) )
pprint(data[k][-1])
# pprint(data) # <-- here is your data
Prints:
Scraping station Badli id=33
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600024&ud=6668
('1',
'200600024',
'1-Apr-2006',
'Shahbad, Daulat Pur.',
'f_publicReport.asp?rep_no=200600024&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600024 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Shahbad, Daulat Pur. Date of Receipt of Call : Saturday, April 1, 2006 '
'Time of Receipt of Call \t : 17\xa0Hrs\xa0:\xa055\xa0Min Time of '
'Departure From Fire Station: 17\xa0Hrs\xa0:\xa056\xa0Min Approximate '
'Distance From Fire Station: 3\xa0\xa0Kilometers Time of Arrival at Fire '
'Scene: 17\xa0Hrs\xa0:\xa059\xa0Min Nature of Call Fire Date of Leaving From '
'Fire Scene: 4/1/2006 Time of Leaving From Fire Scene: 18\xa0Hrs\xa0:\xa0'
'30\xa0Min Type of Occupancy: Others Occupancy Details in Case of Others: '
'NDPL Category of Fire: Small Type of Building: Low Rise Details of Affected '
'Area: Fire was in electrical wiring. Divisional Officer Delhi Fire Service '
'Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600161&ud=6668
('2',
'200600161',
'5-Apr-2006',
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi.',
'f_publicReport.asp?rep_no=200600161&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600161 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi. Date of '
'Receipt of Call : Wednesday, April 5, 2006 Time of Receipt of Call \t'
' : 19\xa0Hrs\xa0:\xa010\xa0Min Time of Departure From Fire Station: '
'19\xa0Hrs\xa0:\xa011\xa0Min Approximate Distance From Fire Station: '
'1.5\xa0\xa0Kilometers Time of Arrival at Fire Scene: 19\xa0Hrs\xa0:\xa013\xa0'
'Min Nature of Call Fire Date of Leaving From Fire Scene: 4/5/2006 Time of '
'Leaving From Fire Scene: 20\xa0Hrs\xa0:\xa050\xa0Min Type of Occupancy: '
'Others Occupancy Details in Case of Others: MCD Category of Fire: Small Type '
'of Building: Others Building Details in Case of Others: On Road Details of '
'Affected Area: Fire was in Rubbish and dry tree on road. Divisional Officer '
'Delhi Fire Service Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
...and so on.
I am scraping this website. I have the script which scrapes the sentence which contains the relevant information.
Now what i want to do is extract following information from the scraped sentence.
The name of the company that is hiring
The location of the company
The position that the ad is for
Job listings which do not have all three required fields will be discarded.
This is my script
from bs4 import BeautifulSoup
import requests
# scrape the given website
url = "https://news.ycombinator.com/jobs"
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
table = content.find("table", attrs={"class": "itemlist"})
array = []
# now store the required data in an array
for elem in table.findAll('tr', attrs={'class': 'athing'}):
array.append({'id': elem_id,
'listing': elem.find('a',
attrs={'class': 'storylink'}).text})
Most of the jobs seem to have the following pattern
ZeroCater (YC W11) Is Hiring a Principal Engineer in SF
^^^^^ --------- ^^^^^^ -- ^^
Company Position Location
You could split the job title at is hiring and in.
import requests
from bs4 import BeautifulSoup
import re
r=requests.get('https://news.ycombinator.com/jobs')
soup=BeautifulSoup(r.text,'html.parser')
job_titles=list()
for td in soup.findAll('td',{'class':'title'}):
job_titles.append(td.text)
split_regex=re.compile('\sis hiring\s|\sin\s', re.IGNORECASE)
job_titles_lists=[split_regex.split(title) for title in job_titles]
valid_jobs=[l for l in job_titles_lists if len(l) ==3]
#print the output
for l in valid_jobs:
for item,value in zip(['Company','Position','Location'],l):
print(item+':'+value)
print('\n')
Output
Company:Flexport
Position:software engineers
Location:Chicago and San Francisco (flexport.com)
Company:OneSignal
Position:a DevOps Engineer
Location:San Mateo (onesignal.com)
...
Note
Not a perfect solution.
Take permission from the site owner.
I would go with something less specific than Bitto's answer because if you just look for the regex of "is hiring" then you'll miss all the ones that are phrased "is looking" or "is seeking".. The general pattern is: [company] is [verb] [position] in [location]. Based on that, you could just look for the indexes of 'is' and 'in' if you split the sentence into a list and then take the values before 'is', between 'is' and 'in', and after 'in'. Like this:
def split_str(sentence):
sentence = sentence.lower()
sentence = sentence.split(' ')
where_is = sentence.index('is')
where_in = sentence.index('in')
name_company = ' '.join(sentence[0:where_is])
position = ' '.join(sentence[where_is+2:where_in])
location = ' '.join(sentence[where_in+1:len(sentence)])
ans = (name_company, position, location)
test = [True if len(list(x)) !=0 else False for x in ans]
if False in test:
return ('None', 'None', 'None')
else:
return (name_company, position, location)
#not a valid input because it does not have a position
some_sentence1 = 'Streak CRM for Gmail (YC S11) Is Hiring in Vancouver'
#valid because it has company, position, location
some_sentence = 'Flexport is hiring software engineers in Chicago and San Francisco'
print(split_str(some_sentence))
print(split_str(some_sentence1))
I added a checker that would simply determine if a value were missing and then make the entire thing invalid with ('None', 'None', 'None') or return all of the values.
output:
('flexport', 'software engineers', 'chicago and san francisco')
('None', 'None', 'None')
Just an idea, this will also not be perfect as '[company] is looking to hire [position] in [location]' would give you back (company, 'to hire [position]', location)... you could clean this up by checking out the NLTK module though and using that to filter out the nouns vs else.
I am trying to scrape all the articles on this web page: https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/
I can scrape the first article, but need help understanding how to jump to the next article and scrape the information there. Thank you in advance for your support.
import requests
from bs4 import BeautifulSoup
class Content:
def __init__(self,url,title,body):
self.url = url
self.title = title
self.body = body
def getPage(url):
req = requests.get(url)
return BeautifulSoup(req.text, 'html.parser')
# Scaping news articles from Coindesk
def scrapeCoindesk(url):
bs = getPage(url)
title = bs.find("h3").text
body = bs.find("p",{'class':'desc'}).text
return Content(url,title,body)
# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
print ('Title:{}'.format(content.title))
print ('URl: {}\n'.format(content.url))
print (content.body)
You can use the fact that every article is contained within a div.article to iterate over them:
def scrapeCoindesk(url):
bs = getPage(url)
articles = []
for article in bs.find_all("div", {"class": "article"}):
title = article.find("h3").text
body = article.find("p", {"class": "desc"}).text
article_url = article.find("a", {"class": "fade"})["href"]
articles.append(Content(article_url, title, body))
return articles
# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
for article in content:
print(article.url)
print(article.title)
print(article.body)
print("-------------")
You can use find_all with BeautifulSoup:
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import request, re
article = namedtuple('article', 'title, link, timestamp, author, description')
r = requests.get('https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/').text
full_data = soup(r, 'lxml')
results = [[i.text, i['href']] for i in full_data.find_all('a', {'class':'fade'})]
timestamp = [re.findall('(?<=\n)[a-zA-Z\s]+[\d\s,]+at[\s\d:]+', i.text)[0] for i in full_data.find_all('p', {'class':'timeauthor'})]
authors = [i.text for i in full_data.find_all('a', {'rel':'author'})]
descriptions = [i.text for i in full_data.find_all('p', {'class':'desc'})]
full_articles = [article(*(list(i[0])+list(i[1:]))) for i in zip(results, timestamp, authors, descriptions) if i[0][0] != '\n ']
Output:
[article(title='Topping Out? Bitcoin Bulls Need to Defend $9K', link='https://www.coindesk.com/topping-out-bitcoin-bulls-need-to-defend-9k/', timestamp='May 8, 2018 at 09:10 ', author='Omkar Godbole', description='Bitcoin risks falling to levels below $9,000, courtesy of the bearish setup on the technical charts. '), article(title='Bitcoin Risks Drop Below $9K After 4-Day Low', link='https://www.coindesk.com/bitcoin-risks-drop-below-9k-after-4-day-low/', timestamp='May 7, 2018 at 11:00 ', author='Omkar Godbole', description='Bitcoin is reporting losses today but only a break below $8,650 would signal a bull-to-bear trend change. '), article(title="Futures Launch Weighed on Bitcoin's Price, Say Fed Researchers", link='https://www.coindesk.com/federal-reserve-scholars-blame-bitcoins-price-slump-to-the-futures/', timestamp='May 4, 2018 at 09:00 ', author='Wolfie Zhao', description='Cai Wensheng, a Chinese angel investor, says he bought 10,000 BTC after the price dropped earlier this year.\n'), article(title='Bitcoin Looks for Price Support After Failed $10K Crossover', link='https://www.coindesk.com/bitcoin-looks-for-price-support-after-failed-10k-crossover/', timestamp='May 3, 2018 at 10:00 ', author='Omkar Godbole', description='While equity bulls fear drops in May, it should not be a cause of worry for the bitcoin market, according to historical data.'), article(title='Bitcoin Sets Sights Above $10K After Bull Breakout', link='https://www.coindesk.com/bitcoin-sets-sights-10k-bull-breakout/', timestamp='May 3, 2018 at 03:18 ', author='Wolfie Zhao', description="Goldman Sachs is launching a new operation that will use the firm's own money to trade bitcoin-related contracts on behalf of its clients.")]