I want to extract date and summary of an article in a website, here is my code
from bs4 import BeautifulSoup
from selenium import webdriver
full_url = 'https://www.wsj.com/articles/readers-favorite-summer-recipes-11599238648?mod=searchresults&page=1&pos=20'
url0 = full_url
browser0 = webdriver.Chrome('C:/Users/liuzh/Downloads/chromedriver_win32/chromedriver')
browser0.get(url0)
html0 = browser0.page_source
page_soup = BeautifulSoup(html0, 'html5lib')
date = page_soup.find_all("time", class_="timestamp article__timestamp flexbox__flex--1")
sub_head = page_soup.find_all("h2", class_="sub-head")
print(date)
print(sub_head)
I got the following result, how can I obtain the standard form ?(e.g. Sept. 4, 2020 12:57 pm ET; This Labor Day weekend, we’re...)
[<time class="timestamp article__timestamp flexbox__flex--1">
Sept. 4, 2020 12:57 pm ET
</time>]
[<h2 class="sub-head" itemprop="description">This Labor Day weekend, we’re savoring the last of summer with a collection of seasonal recipes shared by Wall Street Journal readers. Each one comes with a story about what this food means to a family and why they return to it each year.</h2>]
Thanks.
Try something like:
for d in date:
print(d.text.strip())
Given your sample html, output should be:
Sept. 4, 2020 12:57 pm ET
Related
I want to crawl this website http://www.truellikon.ch/freizeit-kultur/anlaesse-agenda.html .
I want to extract date and time of each event.
You can see that date is listed above events. In order to extract date and time I need to combine different divs, but the problem is that I do not have 'container' for group of events that are on the same date.
So the only thing that I can do is to extract all events that are between two divs that refer to date.
This is the code for extracting the event info:
from bs4 import BeautifulSoup
import requests
domain = 'truellikon.ch'
url = 'http://www.truellikon.ch/freizeit-kultur/anlaesse-agenda.html'
def get_website_news_links_truellikonCh():
response = requests.get(url, allow_redirects=True)
print("Response for", url, response)
soup = BeautifulSoup(response.content, 'html.parser')
all_events = soup.select('div.eventItem')
for i in all_events:
print(i)
print()
input()
x = get_website_news_links_truellikonCh()
Class name for date is 'listThumbnailMonthName'
My question is how can I combine these divs, how can I write the selectors so that I can get exact date and time, title and body of each event
you have one parent container which is #tx_nezzoagenda_list and then you have to read the children one by one
import re
from bs4 import BeautifulSoup
import requests
url = 'http://www.truellikon.ch/freizeit-kultur/anlaesse-agenda.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one('#tx_nezzoagenda_list')
for child in container.children:
if not child.name:
continue
if 'listThumbnailMonthName' in child.get('class'):
base_date=child.text.strip()
else:
day=child.select_one('.dateDayNumber').text.strip()
title=child.select_one('.titleText').text.strip()
locationDate=child.select_one('.locationDateText').children
time=list(locationDate)[-1].strip()
time=re.sub('\s','', time)
print(title, day, base_date, time)
which outputs
Abendunterhaltung TV Trüllikon 10 Dezember 2021 19:00Uhr-3:00Uhr
Christbaum-Verkauf 18 Dezember 2021 9:30Uhr-11:00Uhr
Silvester Party 31 Dezember 2021 22:00Uhr
Neujahrsapéro 02 Januar 2022 16:00Uhr-18:00Uhr
Senioren-Zmittag 21 Januar 2022 12:00Uhr-15:00Uhr
Theatergruppe "Nume Hüür", Aufführung 23 Januar 2022 13:00Uhr-16:00Uhr
Elektroschrottsammlung 29 Januar 2022 9:00Uhr-12:00Uhr
Senioren Z'mittag 18 Februar 2022 12:00Uhr-15:00Uhr
Frühlingskonzert 10 April 2022 12:17Uhr
Weinländer Musiktag 22 Mai 2022 8:00Uhr
Auffahrtskonzert Altersheim 26 Mai 2022 10:30Uhr
Feierabendmusik und Jubilarenehrung 01 Juli 2022 19:00Uhr
Feierabendmusik 15 Juli 2022 12:24Uhr
Feierabendmusik 19 August 2022 19:00Uhr
Herbstanlass 19 November 2022 20:00Uhr
I'm trying to scrape data from this webpage, and all 900+ pages that follow: https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True
It's important that the scraper does not target the pagination link, but rather iterates through the "page=" number in the url. This is because the data present is loaded dynamically in the original webpage, which the pagination links point back to.
I've tried writing something that loops through the page numbers in the url, via the "last" class of the pagination ul, to find the final page, but I am not sure how to target the specific part of the url, whilst keeping the search query the same for each result
r = requests.get(url_pagination)
soup = BeautifulSoup(r.content, "html.parser")
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}" + "&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
last_page = soup.find('ul', class_='pagination').find('li', class_='last').a['href'].split('=')[1]
dept_page_url = [page_url.format(i) for i in range(1, int(last_page)+1)]
print(dept_page_url)
I would ideally like to scrape just the name from class "secondaryTitle", and the 2nd unnamed div that contains the date, per row.
I keep getting an error: ValueError: invalid literal for int() with base 10: '2019-07-11&searchTerm'
You could try this script, but beware, it goes from page 1 all the way to last page 966:
import requests
from bs4 import BeautifulSoup
next_page_url = 'https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True'
# this goes to page '966'
while True:
print('Scrapping {} ...'.format(next_page_url))
r = requests.get(next_page_url)
soup = BeautifulSoup(r.content, "html.parser")
for secondary_title, date in zip(soup.select('.secondaryTitle'), soup.select('.secondaryTitle + *')):
print('{: >20} - {}'.format(date.get_text(strip=True), secondary_title.get_text(strip=True)))
next_link = soup.select_one('a:has(span:contains(Next))')
if next_link:
next_page_url = 'https://hansard.parliament.uk' + next_link['href'] + '&partial=True'
else:
break
Prints:
Scrapping https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True ...
17 January 2007 - Ian Pearson
21 December 2017 - Baroness Vere of Norbiton
2 May 2019 - Lord Parekh
4 February 2013 - Baroness Hanham
21 December 2017 - Baroness Walmsley
9 February 2010 - Colin Challen
6 February 2002 - Baroness Farrington of Ribbleton
24 April 2007 - Barry Gardiner
17 January 2007 - Rob Marris
7 March 2002 - The Parliamentary Under-Secretary of State, Department for Environment, Food and Rural Affairs (Lord Whitty)
27 October 1999 - Mr. Tom Brake (Carshalton and Wallington)
9 February 2004 - Baroness Miller of Chilthorne Domer
7 March 2002 - The Secretary of State for Environment, Food and Rural Affairs (Margaret Beckett)
27 February 2007 -
8 October 2008 - Baroness Andrews
24 March 2011 - Lord Henley
21 December 2017 - Lord Krebs
21 December 2017 - Baroness Young of Old Scone
16 June 2009 - Mark Lazarowicz
14 July 2006 - Lord Rooker
Scrapping https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&searchTerm=%22climate+change%22&startDate=1800-01-01&page=2&partial=True ...
12 October 2006 - Lord Barker of Battle
29 January 2009 - Lord Giddens
... and so on.
Your error is because you are using the wrong number from your split. You want -1. Observe:
last_page = soup.find('ul', class_='pagination').find('li', class_='last').a['href']
print(last_page)
print(last_page.split('=')[1])
print(last_page.split('=')[-1])
Gives:
/search/Contributions?endDate=2019-07-11&searchTerm=%22climate+change%22&startDate=1800-01-01&page=966
when split and use 1
2019-07-11&searchTerm
versus -1
966
To get the info from each page you want I would do pretty much what the other answer does in terms of css selectors and zipping. Some other looping constructs below and use of Session for efficiency given number of requests.
You could make an initial request and extract the number of pages then loop for those. Use Session object for efficiency of connection re-use.
import requests
from bs4 import BeautifulSoup as bs
def make_soup(s, page):
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
r = s.get(page_url.format(page))
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
soup = make_soup(s, 1)
pages = int(soup.select_one('.last a')['href'].split('page=')[1])
for page in range(2, pages + 1):
soup = make_soup(s, page)
#do something with soup
You could loop until class last ceases to appear
import requests
from bs4 import BeautifulSoup as bs
present = True
page = 1
#results = {}
def make_soup(s, page):
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
r = s.get(page_url.format(page))
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
while present:
soup = make_soup(s, page)
present = len(soup.select('.last')) > 0
#results[page] = soup.select_one('.pagination-total').text
#extract info
page+=1
I am trying to scrape the webpage of the new york times. My code is running fine as it is showing exit code 0 but giving no results.
import time
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com/search?endDate=20190331&query=cybersecurity&sort=newest&startDate=20180401={}'
pages = [0]
for page in pages:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#search-results li > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text, "lxml")
date = sauce.select(".css-1vkm6nb ehdk2mb0 h1")
date = date.text
print(date)
time.sleep(3)
with this code, I am hoping to get the publish date from each article.
Nice attempt--you're pretty close. The problem is the selectors:
#search-results asks for an id that doesn't exist. The element is a <ol data-testid="search-results">, so we'll need other means to grab this anchor tag.
.css-1vkm6nb ehdk2mb0 h1 doesn't make much sense. It asks for an element h1 that is inside of a ehdk2mb0 element which is inside of an element with the class .css-1vkm6nb. What's actually on the page is an <h1 class="css-1vkm6nb ehdk2mb0"> element. Select this with h1.css-1vkm6nb.ehdk2mb0.
Having said that, this is not the time data you're after--it's the title. We can get the time element (<time>) with a simple sauce.find("time").
Full example:
import requests
from bs4 import BeautifulSoup
base = "https://www.nytimes.com"
url = "https://www.nytimes.com/search?endDate=20190331&query=cybersecurity&sort=newest&startDate=20180401={}"
pages = [0]
for page in pages:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for link in soup.select(".css-138we14 a"):
resp = requests.get(base + link.get("href"))
sauce = BeautifulSoup(resp.text, "lxml")
title = sauce.select_one("h1.css-1j5ig2m.e1h9rw200")
time = sauce.find("time")
print(time.text, title.text.encode("utf-8"))
Output:
March 30, 2019 b'Bezos\xe2\x80\x99 Security Consultant Accuses Saudis of Hacking the Amazon C.E.O.\xe2\x80\x99s Phone'
March 29, 2019 b'In Ukraine, Russia Tests a New Facebook Tactic in Election Tampering'
March 28, 2019 b'Huawei Shrugs Off U.S. Clampdown With a $100 Billion Year'
March 28, 2019 b'N.S.A. Contractor Arrested in Biggest Breach of U.S. Secrets Pleads Guilty'
March 28, 2019 b'Grindr Is Owned by a Chinese Firm, and the U.S. Is Trying to Force It to Sell'
March 28, 2019 b'DealBook Briefing: Saudi Arabia Wanted Cash. Aramco Just Obliged.'
March 28, 2019 b'Huawei Security \xe2\x80\x98Defects\xe2\x80\x99 Are Found by British Authorities'
March 25, 2019 b'As Special Counsel, Mueller Kept Such a Low Profile He Seemed Almost Invisible'
March 21, 2019 b'Quotation of the Day: In New Age of Digital Warfare, Spies for Any Nation\xe2\x80\x99s Budget'
March 21, 2019 b'Coast Guard\xe2\x80\x99s Top Officer Pledges \xe2\x80\x98Dedicated Campaign\xe2\x80\x99 to Improve Diversity'
I am trying to collect the event date, time and venue. They came out successfully but then it is not reader friendly. How do I get the date, time and venue to appear separately like:
- event
Date:
Time:
Venue:
- event
Date:
Time:
Venue:
I was thinking of splitting but I ended up with lots of [ ] which made it looked even uglier. I thought of stripping but my regular expression but it does not appear to do anything. Any suggestions?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = urllib.request.urlopen(url_toscrape)
info_type = response.info()
responseData = response.read()
soup = BeautifulSoup(responseData, 'lxml')
events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for first in events_absFirst:
print('-',first.text.strip())
print (' ',date)
for tr in soup.find_all("div",{"class":"ntu_event_detail"}):
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for events in events_absAll:
events = events.text.strip()
for date in date_absAll:
date = date.text.strip('^Time.*')
print ('-',events)
print (' ',date)
You can iterate over the divs containing the event information, store the results, and then print each:
import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.ntu.edu.sg/events/Pages/default.aspx').text, 'html.parser')
results = [[getattr(i.find('div', {'class':re.compile('ntu_event_summary_title_first|ntu_event_summary_title')}), 'text', 'N/A'), getattr(i.find('div', {'class':'ntu_event_summary_detail'}), 'text', 'N/A')] for i in d.find_all('div', {'class':'ntu_event_articles'})]
new_results = [[a, re.findall('Date : .*?(?=\sTime)|Time : .*?(?=Venue)|Time : .*?(?=$)|Venue: [\w\W]+', b)] for a, b in results]
print('\n\n'.join('-{}\n{}'.format(a, '\n'.join(f' {h}:{i}' for h, i in zip(['Date', 'Time', 'Venue'], b))) for a, b in new_results))
Output:
-7th ASEF Rectors' Conference and Students' Forum (ARC7)
Date:Date : 29 Nov 2018 to 14 May 2019
Time:Time : 9:00am to 5:00pm
-Be a Youth Corps Leader
Date:Date : 1 Dec 2018 to 31 Mar 2019
Time:Time : 9:00am to 5:00pm
-NIE Visiting Artist Programme January 2019
Date:Date : 14 Jan 2019 to 11 Apr 2019
Time:Time : 9:00am to 8:00pm
Venue:Venue: NIE Art gallery
-Exercise Classes for You: Healthy Campus#NTU
Date:Date : 21 Jan 2019 to 18 Apr 2019
Time:Time : 6:00pm to 7:00pm
Venue:Venue: The Wave # Sports & Recreation Centre
-[eLearning Course] Information & Media Literacy (From January 2019)
Date:Date : 23 Jan 2019 to 31 May 2019
Time:Time : 9:00am to 5:00pm
Venue:Venue: NTULearn
...
You could use requests and test the length of stripped_strings
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = requests.get(url_toscrape)
soup = BeautifulSoup(response.content, 'lxml')
events = [item.text for item in soup.select("[class^='ntu_event_summary_title']")]
data = soup.select('.ntu_event_summary_date')
dates = []
times = []
venues = []
for item in data:
strings = [string for string in item.stripped_strings]
if len(strings) == 3:
dates.append(strings[0])
times.append(strings[1])
venues.append(strings[2])
elif len(strings) == 2:
dates.append(strings[0])
times.append(strings[1])
venues.append('N/A')
elif len(strings) == 1:
dates.append(strings[0])
times.append('N/A')
venues.append('N/A')
results = list(zip(events, dates, times, venues))
df = pd.DataFrame(results)
print(df)
I am trying to parse an ESPN webpage to get the date, time, and teams playing in each NFL game for a given week using BeautifulSoup. I am able to get most of the information, however, I am having trouble with the time information.
For some reason, the text between the a tag is not being returned.
The html for one of the a tags is:
<a data-dateformat="time1" name="&lpos=nfl:schedule:time" href="/nfl/game?gameId=400874572">12:00 PM</a>
I am looking to get the "12:00 PM" in between the a tags, but instead I get:
<a data-dateformat="time1" href="/nfl/game?gameId=400874572" name="&lpos=nfl:schedule:time"></a>
which doesn't have any text in between the tags.
Here is what I have used to parse the webpage.
import urllib2
from bs4 import BeautifulSoup
def parse_nfl_schedule_espn():
schedule = BeautifulSoup(urllib2.urlopen("http://www.espn.com/nfl/schedule/_/week/10").read(), "lxml")
for date in schedule.find_all('h2'):
#separate by game
game_info = date.nextSibling.find_all('tr')
date = str(date).split(">")
date = date[1].split("<")
date = date[0]
#print date
for i in range(len(game_info)):
#separate each part of game row
value = game_info[i].find_all('td')
#iterate over <thead>
if len(value) > 1:
#away team abv
away = str(value[0].find('abbr')).split(">")
away = away[1].split("<")
away = away[0]
#home team abv
home = str(value[1].find('abbr')).split(">")
home = home[1].split("<")
home = home[0]
time = value[2].find_all('a')
print time
#print "%s at %s" % (away, home)
if __name__ == "__main__":
parse_nfl_schedule_espn()
Any help/suggestions would be much appreciated.
You will need to use something like Selenium to get the HTML. This would then allow the browser to run any Javascript. This can be done as follows:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
def parse_nfl_schedule_espn():
browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get("http://www.espn.com/nfl/schedule/_/week/10")
schedule = BeautifulSoup(browser.page_source, "lxml")
for date in schedule.find_all('a', attrs={'data-dateformat' : "time1"}):
print date.text
if __name__ == "__main__":
parse_nfl_schedule_espn()
Which would display the following:
6:00 PM
6:00 PM
6:00 PM
6:00 PM
6:00 PM
6:00 PM
6:00 PM
6:00 PM
9:05 PM
9:25 PM
9:25 PM
1:30 AM
1:30 AM
You could also investigate "headless" solutions such as PhantomJS to avoid having to see a browser window being displayed.