i know a very little about backend programming, i wanted to scrape the data from Delhi Fire Service for my academic project, there are online fire reports available for zone wise of Delhi. for each zone lots of files available
by the way if you directly go to this link you will get an empty page(i don't know why). Further, now if i click on one file it will open like this
and now there is pattern in link, each time report number changes, rest link remain the same, so i obtained all the links for scraping. The problem is i am facing is when i load the link using beautifulSoup i am not getting the same content of that report if i load the same link on browser
import bs4 as bs
import urllib.request
import requests
with open("p.html",'r') as f:
page = f.read()
soup = bs.BeautifulSoup(page,'lxml')
links =soup.find_all('a')
urls=[]
for link in links:
urls.append(link.get('href'))
string1="http://delhigovt.nic.in/FireReport/"
# print(urls)
link1 = string1 + urls[1]
print(link1)
sauce = requests.get(link1)
soup = bs.BeautifulSoup(sauce.content,'lxml')
print(soup)
and it is random, sometimes if a copy the link and load it in new tab(or other browser) it convert to error page so i lost the report information, i am not able to scrape the data this way even if i have all the links for all the report. can someone tell me what is going on. Thank You
Update - link http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public
there you have to select option "no" in right corner to able to get the "search" button
To scrape the page, you need to use requests.session to set the cookies correctly. Also there's parameter ud in the POST request that the page use and needs to be correctly set.
For example (this scrapes all stations and reports and stores it in dictionary data):
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = 'http://delhigovt.nic.in/FireReport/r_publicSearch.asp?user=public'
post_url = 'http://delhigovt.nic.in/FireReport/a_publicSearch.asp'
params = {'ud': '',
'fstation': '',
'caller': '',
'add': '',
'frmdate': '',
'todate': '',
'save': 'Search'}
def open_report(s, url):
url = 'http://delhigovt.nic.in/FireReport/' + url
print(url)
soup = BeautifulSoup(s.get(url).content, 'lxml')
# just return some text here
return soup.select('body > table')[1].get_text(strip=True, separator=' ')
data = {}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'lxml')
stations = {}
for option in soup.select('select[name="fstation"] option[value]:not(:contains("Select Fire Station"))'):
stations[option.get_text(strip=True)] = option['value']
params['ud'] = soup.select_one('input[name="ud"][value]')['value']
for k, v in stations.items():
print('Scraping station {} id={}'.format(k, v))
params['fstation'] = int(v)
soup = BeautifulSoup( s.post(post_url, data=params).content, 'lxml' )
for tr in soup.select('tr:has(> td > a[href^="f_publicReport.asp?rep_no="])'):
no, fire_report_no, date, address = tr.select('td')
link = fire_report_no.a['href']
data.setdefault(k, [])
data[k].append( (no.get_text(strip=True), fire_report_no.get_text(strip=True), date.get_text(strip=True), address.get_text(strip=True), link, open_report(s, link)) )
pprint(data[k][-1])
# pprint(data) # <-- here is your data
Prints:
Scraping station Badli id=33
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600024&ud=6668
('1',
'200600024',
'1-Apr-2006',
'Shahbad, Daulat Pur.',
'f_publicReport.asp?rep_no=200600024&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600024 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Shahbad, Daulat Pur. Date of Receipt of Call : Saturday, April 1, 2006 '
'Time of Receipt of Call \t : 17\xa0Hrs\xa0:\xa055\xa0Min Time of '
'Departure From Fire Station: 17\xa0Hrs\xa0:\xa056\xa0Min Approximate '
'Distance From Fire Station: 3\xa0\xa0Kilometers Time of Arrival at Fire '
'Scene: 17\xa0Hrs\xa0:\xa059\xa0Min Nature of Call Fire Date of Leaving From '
'Fire Scene: 4/1/2006 Time of Leaving From Fire Scene: 18\xa0Hrs\xa0:\xa0'
'30\xa0Min Type of Occupancy: Others Occupancy Details in Case of Others: '
'NDPL Category of Fire: Small Type of Building: Low Rise Details of Affected '
'Area: Fire was in electrical wiring. Divisional Officer Delhi Fire Service '
'Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
http://delhigovt.nic.in/FireReport/f_publicReport.asp?rep_no=200600161&ud=6668
('2',
'200600161',
'5-Apr-2006',
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi.',
'f_publicReport.asp?rep_no=200600161&ud=6668',
'Current Date:\xa0\xa0\xa0Tuesday, January 7, 2020 Fire Report '
'Number  : 200600161 Operational Jurisdiction of Fire Station : '
'Badli Information Received From: PCR Full Address of Incident Place: '
'Haidarpur towards Mubarak Pur , Outer Ring Road, Near Nullah, Delhi. Date of '
'Receipt of Call : Wednesday, April 5, 2006 Time of Receipt of Call \t'
' : 19\xa0Hrs\xa0:\xa010\xa0Min Time of Departure From Fire Station: '
'19\xa0Hrs\xa0:\xa011\xa0Min Approximate Distance From Fire Station: '
'1.5\xa0\xa0Kilometers Time of Arrival at Fire Scene: 19\xa0Hrs\xa0:\xa013\xa0'
'Min Nature of Call Fire Date of Leaving From Fire Scene: 4/5/2006 Time of '
'Leaving From Fire Scene: 20\xa0Hrs\xa0:\xa050\xa0Min Type of Occupancy: '
'Others Occupancy Details in Case of Others: MCD Category of Fire: Small Type '
'of Building: Others Building Details in Case of Others: On Road Details of '
'Affected Area: Fire was in Rubbish and dry tree on road. Divisional Officer '
'Delhi Fire Service Disclaimer: This is a computer generated report.\r\n'
'Neither department nor its associates, information providers or content '
'providers warrant or guarantee the timeliness, sequence, accuracy or '
'completeness of this information.')
...and so on.
Related
I am making a webscrapper for Bookdepository and I came across a problem with the html elements of the site. The page for a book has a section called Product Details and I need to take each element from the list. However some of the elements, not all, like Language have this structure
sample image. How is it possible to get this element?
My work in progress is this. Thanks a lot in advance
import bs4
from urllib.request import urlopen
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img
book_language = book_info.find_next(string='Language',)
book_format = book_info.find_all(string='Format', )
print('Number of Pages: ' + book_pages.strip())
print('ISBN Number: ' + book_ibsn)
print('Publication Date: ' + book_publication_date)
print('Publisher Name: ' + book_publisher.strip())
print('Author: '+ book_author.strip())
print(book_cover)
print(book_language)
print(book_format)
To get the corresponding <span> to your label you could go with:
book_info.find_next(string='Language').find_next('span').get_text(strip=True)
A more generic approach to get all these product details could be:
import bs4, re
from urllib.request import urlopen
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book = {
'description':soup.find('div', class_='item-excerpt trunc').get_text(strip=True),
'title':soup.find('h1').text
}
book.update({e.label.text.strip():re.sub('\s+', ' ',e.span.text).strip() for e in soup.select('.biblio-info li')})
book
Output:
{'description': "'A breathtaking memoir...I was so moved by this book.' Oprah'It is startlingly honest and, at times, a jaw-dropping read, charting her rise from poverty and abuse to becoming the first African-American to win the triple crown of an Oscar, Emmy and Tony for acting.' BBC NewsTHE DEEPLY PERSONAL, BRUTALLY HONEST ACCOUNT OF VIOLA'S INSPIRING LIFEIn my book, you will meet a little girl named Viola who ran from her past until she made a life changing decision to stop running forever.This is my story, from a crumbling apartment in Central Falls, Rhode Island, to the stage in New York City, and beyond. This is the path I took to finding my purpose and my strength, but also to finding my voice in a world that didn't always see me.As I wrote Finding Me, my eyes were open to the truth of how our stories are often not given close examination. They are bogarted, reinvented to fit into a crazy, competitive, judgmental world. So I wrote this for anyone who is searching for a way to understand and overcome a complicated past, let go of shame, and find acceptance. For anyone who needs reminding that a life worth living can only be born from radical honesty and the courage to shed facades and be...you.Finding Me is a deep reflection on my past and a promise for my future. My hope is that my story will inspire you to light up your own life with creative expression and rediscover who you were before the world put a label on you.show more",
'title': 'Finding Me : A Memoir - THE INSTANT SUNDAY TIMES BESTSELLER',
'Format': 'Hardback | 304 pages',
'Dimensions': '160 x 238 x 38mm | 520g',
'Publication date': '26 Apr 2022',
'Publisher': 'Hodder & Stoughton',
'Imprint': 'Coronet Books',
'Publication City/Country': 'London, United Kingdom',
'Language': 'English',
'ISBN10': '1399703994',
'ISBN13': '9781399703994',
'Bestsellers rank': '31'}
You can check if the label text is equal to Language and then print the text. I have also added a better approach to parse the product details section in a single iteration.
Check the code given below:-
import bs4
from urllib.request import urlopen
import re
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_info = soup.find('ul', class_='biblio-info')
lis=book_info.find_all('li')
# Check if the label name is Language and the print the span text
for val in lis:
label=val.find('label')
if label.text.strip()=='Language':
span=val.find('span')
span_text=(span.text.strip())
print('Language--> '+span_text)
# A better approach to get all the Name and value pairs in the Product details section in a single iteration
for val in lis:
label=val.find('label')
span=val.find('span')
span_text=(span.text.strip())
modified_text = re.sub('\n', ' ', span_text)
modified_text = re.sub(' +', ' ', modified_text)
print(label.text.strip()+'--> '+modified_text)
You can grab desired data from details portion using css selectors
import bs4
from urllib.request import urlopen
import re
book_isbn = ("9781399703994")
book_urls = "https://www.bookdepository.com/Enid-Blytons-Christmas-Tales-Enid-Blyton/" + book_isbn
#print(book_urls)
source = urlopen(book_urls).read()
soup = bs4.BeautifulSoup(source,'lxml')
book_description = soup.find('div', class_='item-excerpt trunc')
book_title = soup.find('h1').text
book_info = soup.find('ul', class_='biblio-info')
book_pages = book_info.find('span', itemprop='numberOfPages').text
book_ibsn = book_info.find('span', itemprop='isbn').text
book_publication_date = book_info.find('span', itemprop='datePublished').text
book_publisher = book_info.find('span', itemprop='name').text
book_author = soup.find('span', itemprop="author").text
book_cover = soup.find('div', class_='item-img-content').img.get('src')
book_language =soup.select_one('.biblio-info > li:nth-child(7) span').get_text(strip=True)
book_format = soup.select_one('.biblio-info > li:nth-child(1) span').get_text(strip=True)
book_format = re.sub(r'\s+', ' ',book_format).replace('|','')
print('Number of Pages: ' + book_pages.strip())
print('ISBN Number: ' + book_ibsn)
print('Publication Date: ' + book_publication_date)
print('Publisher Name: ' + book_publisher.strip())
print('Author: '+ book_author.strip())
print(book_cover)
print(book_language)
print(book_format)
Output:
Number of Pages: 304 pages
ISBN Number: 9781399703994
Publication Date: 26 Apr 2022
Publisher Name: Hodder & Stoughton
Author: Viola Davis
https://d1w7fb2mkkr3kw.cloudfront.net/assets/images/book/lrg/9781/3997/9781399703994.jpg
English
Hardback 304 pages
I am looking to use Beautiful Soup to scrape the Fujitsu news update page: https://www.fujitsu.com/uk/news/pr/2020/
I only want to extract the information under the headings of the current month and previous month.
For a particular month (e.g. November), I am trying to extract into a list
the Title
the URL
the text
for each news briefing (so a list of lists).
My attempt so far is as follow (showing only previous month for simplicity):
today = datetime.datetime.today()
year_str = str(today.year)
current_m = today.month
previous_m = current_m - 1
current_m_str = calendar.month_name[current_m]
previous_m_str = calendar.month_name[previous_m]
URL = 'https://www.fujitsu.com/uk/news/pr/' + year_str + '/'
resp = requests.get(URL)
soup = BeautifulSoup(resp.text, 'lxml')
previous_m_body = soup.find('h3', text=previous_m_str)
if previous_m_body is not None:
for sib in previous_m_body.find_next_siblings():
if sib.name == "h3":
break
else:
previous_m_text = str(sib.text)
print(previous_m_text)
However, this generates one long string with newlines, and no separation between Title, text, url:
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution London, United Kingdom, November 30, 2020 - Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local...
Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive Loads Hoofddorp, EMEA, November 11, 2020 - Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring.......
I have attached an image of the page DOM.
Try this:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.fujitsu.com/uk/news/pr/2020/").text
all_lists = BeautifulSoup(html, "html.parser").find_all("ul", class_="filterlist")
news = []
for unordered_list in all_lists:
for list_item in unordered_list.find_all("li"):
news.append(
[
list_item.find("a").getText(),
f"https://www.fujitsu.com{list_item.find('a')['href']}",
list_item.getText(strip=True)[len(list_item.find("a").getText()):],
]
)
for news_item in news:
print("\n".join(news_item))
print("-" * 80)
Output (shortened for brevity):
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution
https://www.fujitsu.com/uk/news/pr/2020/fs-20201130.html
London, United Kingdom, November 30, 2020- Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local Authorities to support the electronic counting (e-Counting) of ballot papers at the Scottish Local Government elections in May 2022.Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive LoadsHoofddorp, EMEA, November 11, 2020- Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring a switching capacity of 50A at 14VDC. The FBR53-HC offers a higher contact rating than its 40A FBR53-HW counterpart, yet occupies the same 12.1 x 15.5 x 13.7mm footprint and weighs the same 6g.
--------------------------------------------------------------------------------
and more ...
EDIT:
To get just the last two months, all you need is the first two ul items from the soup. So, add [:2] to the first for loop, like this:
for unordered_list in all_lists[:2]:
# the rest of the loop body goes here
here I modified your code. I combined your bs4 code with selenium. Selenium is very powerful for scrape dynamic or JavaScript based website. You can use selenium with BeautifulSoup for make your life easier. Now it will give you output for all months.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.maximize_window()
url = "https://www.fujitsu.com/uk/news/pr/2020/" #change the url if you want to get result for different year
driver.get(url)
# now your bs4 code start. It will give you output from current month to previous all month
soup = BeautifulSoup(driver.page_source, "html.parser")
#here I am getting all month name from January to november.
months = soup.find_all('h3')
for month in months:
month = month.text
print(f"month_name : {month}\n")
#here we are getting all description text from current month to all previous months
description_texts = soup.find_all('ul',class_='filterlist')
for description_text in description_texts:
description_texts = description_text.text.replace('\n','')
print(f"description_text: {description_texts}")
output:
I am trying to scrape a website for titles as well as other items but for the sake of brevity, just game titles.
I have tried using selenium and beautiful soup in tandem to grab the titles, but I cannot seem to get all the September releases no matter what I do. In fact, I get some of the August game titles as well. I think it has to do with the fact that there is no ending to the website. How would I grab just the September titles? Below is the code I used and I have tried to use Scrolling but I do not think I understand how to use it properly.
EDIT: My goal is to be able to eventually get each month by changing a few lines of code.
from selenium import webdriver
from bs4 import BeautifulSoup
titles = []
chromedriver = 'C:/Users/Chase The Great/Desktop/Podcast/chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.releases.com/l/Games/2019/9/')
res = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(res, 'lxml')
for title in soup.find_all(class_= 'calendar-item-title'):
titles.append(title.text)
I am expected to get 133 titles and I get some August titles plus only part of the titles as such:
['SubaraCity', 'AER - Memories of Old', 'Vambrace: Cold Soul', 'Agent A: A Puzzle in Disguise', 'Bubsy: Paws on Fire!', 'Grand Brix Shooter', 'Legend of the Skyfish', 'Vambrace: Cold Soul', 'Obakeidoro!', 'Pokemon Masters', 'Decay of Logos', 'The Lord of the Rings: Adventure ...', 'Heave Ho', 'Newt One', 'Blair Witch', 'Bulletstorm: Duke of Switch Edition', 'The Ninja Saviors: Return of the ...', 'Re:Legend', 'Risk of Rain 2', 'Decay of Logos', 'Unlucky Seven', 'The Dark Pictures Anthology: Man ...', 'Legend of the Skyfish', 'Astral Chain', 'Torchlight II', 'Final Fantasy VIII Remastered', 'Catherine: Full Body', 'Root Letter: Last Answer', 'Children of Morta', 'Himno', 'Spyro Reignited Trilogy', 'RemiLore: Lost Girl in the Lands ...', 'Divinity: Original Sin 2 - Defini...', 'Monochrome Order', 'Throne Quest Deluxe', 'Super Kirby Clash', 'Himno', 'Post War Dreams', 'The Long Journey Home', 'Spice and Wolf VR', 'WRC 8', 'Fantasy General II', 'River City Girls', 'Headliner: NoviNews', 'Green Hell', 'Hyperforma', 'Atomicrops', 'Remothered: Tormented Fathers']
Seems to me that in order to get only september, first you want to grab only the section for september:
section = soup.find('section', {'class': 'Y2019-M9 calendar-sections'})
Then once you fetch the section for September get all the titles which are in an <a> tag like this:
for title in section.find_all('a', {'class': ' calendar-item-title subpage-trigg'}):
titles.append(title.text)
Please note that none of the previous has been tested.
UPDATE:
The problem is that everytime you want load the page, it gives you only the very first section that contains only 24 items, in order to access them you have to scroll down(infinite scroll).
If you open the browser developers tool, select Network and then XHR you will notice that everytime you scroll and load the next "page" there is a request with an url similar to this:
https://www.releases.com/calendar/nextAfter?blockIndex=139&itemIndex=23&category=Games®ionId=us
Where my guess is that blockIndex is meant for the month and itemIndex is for every page loaded, if you are looking only for the month of september blockIndex will be always 139 in that request the challenge is to get the next itemIndex for the next page so you can construct your next request.
The next itemIndex will be always the last itemIndex of the previous request.
I did make a script that does what you want using only BeautifulSoup. Use it at your own discretion, there are some constants that may be extracted dynamically, but I think this could give you a head start:
import json
import requests
from bs4 import BeautifulSoup
DATE_CODE = 'Y2019-M9'
LAST_ITEM_FIRST_PAGE = f'calendar-item col-xs-6 to-append first-item calendar-last-item {DATE_CODE}-None'
LAST_ITEM_PAGES = f'calendar-item col-xs-6 to-append calendar-last-item {DATE_CODE}-None'
INITIAL_LINK = 'https://www.releases.com/l/Games/2019/9/'
BLOCK = 139
titles = []
def get_next_page_link(div: BeautifulSoup):
index = div['item-index']
return f'https://www.releases.com/calendar/nextAfter?blockIndex={BLOCK}&itemIndex={index}&category=Games®ionId=us'
def get_content_from_requests(page_link):
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
req = requests.get(page_link, headers=headers)
return BeautifulSoup(req.content, 'html.parser')
def scroll_pages(link: str):
print(link)
page = get_content_from_requests(link)
for div in page.findAll('div', {'date-code': DATE_CODE}):
item = div.find('a', {'class': 'calendar-item-title subpage-trigg'})
if item:
# print(f'TITLE: {item.getText()}')
titles.append(item.getText())
last_index_div = page.find('div', {'class': LAST_ITEM_FIRST_PAGE})
if not last_index_div:
last_index_div = page.find('div', {'class': LAST_ITEM_PAGES})
if last_index_div:
scroll_pages(get_next_page_link(last_index_div))
else:
print(f'Found: {len(titles)} Titles')
print('No more pages to scroll finishing...')
scroll_pages(INITIAL_LINK)
with open(f'titles.json', 'w') as outfile:
json.dump(titles, outfile)
if your goal is to use Selenium, I think the same principle may apply unless it has a scrolling capability as it is loading the page.
Replacing INITIAL_LINK, DATE_CODE & BLOCK accordingly, will get you other months as well.
I am scraping this website. I have the script which scrapes the sentence which contains the relevant information.
Now what i want to do is extract following information from the scraped sentence.
The name of the company that is hiring
The location of the company
The position that the ad is for
Job listings which do not have all three required fields will be discarded.
This is my script
from bs4 import BeautifulSoup
import requests
# scrape the given website
url = "https://news.ycombinator.com/jobs"
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
table = content.find("table", attrs={"class": "itemlist"})
array = []
# now store the required data in an array
for elem in table.findAll('tr', attrs={'class': 'athing'}):
array.append({'id': elem_id,
'listing': elem.find('a',
attrs={'class': 'storylink'}).text})
Most of the jobs seem to have the following pattern
ZeroCater (YC W11) Is Hiring a Principal Engineer in SF
^^^^^ --------- ^^^^^^ -- ^^
Company Position Location
You could split the job title at is hiring and in.
import requests
from bs4 import BeautifulSoup
import re
r=requests.get('https://news.ycombinator.com/jobs')
soup=BeautifulSoup(r.text,'html.parser')
job_titles=list()
for td in soup.findAll('td',{'class':'title'}):
job_titles.append(td.text)
split_regex=re.compile('\sis hiring\s|\sin\s', re.IGNORECASE)
job_titles_lists=[split_regex.split(title) for title in job_titles]
valid_jobs=[l for l in job_titles_lists if len(l) ==3]
#print the output
for l in valid_jobs:
for item,value in zip(['Company','Position','Location'],l):
print(item+':'+value)
print('\n')
Output
Company:Flexport
Position:software engineers
Location:Chicago and San Francisco (flexport.com)
Company:OneSignal
Position:a DevOps Engineer
Location:San Mateo (onesignal.com)
...
Note
Not a perfect solution.
Take permission from the site owner.
I would go with something less specific than Bitto's answer because if you just look for the regex of "is hiring" then you'll miss all the ones that are phrased "is looking" or "is seeking".. The general pattern is: [company] is [verb] [position] in [location]. Based on that, you could just look for the indexes of 'is' and 'in' if you split the sentence into a list and then take the values before 'is', between 'is' and 'in', and after 'in'. Like this:
def split_str(sentence):
sentence = sentence.lower()
sentence = sentence.split(' ')
where_is = sentence.index('is')
where_in = sentence.index('in')
name_company = ' '.join(sentence[0:where_is])
position = ' '.join(sentence[where_is+2:where_in])
location = ' '.join(sentence[where_in+1:len(sentence)])
ans = (name_company, position, location)
test = [True if len(list(x)) !=0 else False for x in ans]
if False in test:
return ('None', 'None', 'None')
else:
return (name_company, position, location)
#not a valid input because it does not have a position
some_sentence1 = 'Streak CRM for Gmail (YC S11) Is Hiring in Vancouver'
#valid because it has company, position, location
some_sentence = 'Flexport is hiring software engineers in Chicago and San Francisco'
print(split_str(some_sentence))
print(split_str(some_sentence1))
I added a checker that would simply determine if a value were missing and then make the entire thing invalid with ('None', 'None', 'None') or return all of the values.
output:
('flexport', 'software engineers', 'chicago and san francisco')
('None', 'None', 'None')
Just an idea, this will also not be perfect as '[company] is looking to hire [position] in [location]' would give you back (company, 'to hire [position]', location)... you could clean this up by checking out the NLTK module though and using that to filter out the nouns vs else.
I am trying to scrape all the articles on this web page: https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/
I can scrape the first article, but need help understanding how to jump to the next article and scrape the information there. Thank you in advance for your support.
import requests
from bs4 import BeautifulSoup
class Content:
def __init__(self,url,title,body):
self.url = url
self.title = title
self.body = body
def getPage(url):
req = requests.get(url)
return BeautifulSoup(req.text, 'html.parser')
# Scaping news articles from Coindesk
def scrapeCoindesk(url):
bs = getPage(url)
title = bs.find("h3").text
body = bs.find("p",{'class':'desc'}).text
return Content(url,title,body)
# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
print ('Title:{}'.format(content.title))
print ('URl: {}\n'.format(content.url))
print (content.body)
You can use the fact that every article is contained within a div.article to iterate over them:
def scrapeCoindesk(url):
bs = getPage(url)
articles = []
for article in bs.find_all("div", {"class": "article"}):
title = article.find("h3").text
body = article.find("p", {"class": "desc"}).text
article_url = article.find("a", {"class": "fade"})["href"]
articles.append(Content(article_url, title, body))
return articles
# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
for article in content:
print(article.url)
print(article.title)
print(article.body)
print("-------------")
You can use find_all with BeautifulSoup:
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import request, re
article = namedtuple('article', 'title, link, timestamp, author, description')
r = requests.get('https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/').text
full_data = soup(r, 'lxml')
results = [[i.text, i['href']] for i in full_data.find_all('a', {'class':'fade'})]
timestamp = [re.findall('(?<=\n)[a-zA-Z\s]+[\d\s,]+at[\s\d:]+', i.text)[0] for i in full_data.find_all('p', {'class':'timeauthor'})]
authors = [i.text for i in full_data.find_all('a', {'rel':'author'})]
descriptions = [i.text for i in full_data.find_all('p', {'class':'desc'})]
full_articles = [article(*(list(i[0])+list(i[1:]))) for i in zip(results, timestamp, authors, descriptions) if i[0][0] != '\n ']
Output:
[article(title='Topping Out? Bitcoin Bulls Need to Defend $9K', link='https://www.coindesk.com/topping-out-bitcoin-bulls-need-to-defend-9k/', timestamp='May 8, 2018 at 09:10 ', author='Omkar Godbole', description='Bitcoin risks falling to levels below $9,000, courtesy of the bearish setup on the technical charts. '), article(title='Bitcoin Risks Drop Below $9K After 4-Day Low', link='https://www.coindesk.com/bitcoin-risks-drop-below-9k-after-4-day-low/', timestamp='May 7, 2018 at 11:00 ', author='Omkar Godbole', description='Bitcoin is reporting losses today but only a break below $8,650 would signal a bull-to-bear trend change. '), article(title="Futures Launch Weighed on Bitcoin's Price, Say Fed Researchers", link='https://www.coindesk.com/federal-reserve-scholars-blame-bitcoins-price-slump-to-the-futures/', timestamp='May 4, 2018 at 09:00 ', author='Wolfie Zhao', description='Cai Wensheng, a Chinese angel investor, says he bought 10,000 BTC after the price dropped earlier this year.\n'), article(title='Bitcoin Looks for Price Support After Failed $10K Crossover', link='https://www.coindesk.com/bitcoin-looks-for-price-support-after-failed-10k-crossover/', timestamp='May 3, 2018 at 10:00 ', author='Omkar Godbole', description='While equity bulls fear drops in May, it should not be a cause of worry for the bitcoin market, according to historical data.'), article(title='Bitcoin Sets Sights Above $10K After Bull Breakout', link='https://www.coindesk.com/bitcoin-sets-sights-10k-bull-breakout/', timestamp='May 3, 2018 at 03:18 ', author='Wolfie Zhao', description="Goldman Sachs is launching a new operation that will use the firm's own money to trade bitcoin-related contracts on behalf of its clients.")]