Links on original webpage missing after parsing with beautiful soup - python
Please excuse me if my explanation seems elementary. I'm new to both python and beautiful soup.
I'm trying to extract data from the following website :
https://valor.militarytimes.com/award/5?page=1
I want to extract the links that correspond to each of the 24 medal recipients on the website. I can see from Firefox inspector that they all have the word 'hero' in their links. However, when I use beautiful soup to parse the website, these links do not appear.
I have tried using the standard html parser as well as the html5lib parser but none of them show the links corresponding to these medal recipients.
page = requests.get('https://valor.militarytimes.com/award/5?page=1')
soup = BeautifulSoup(page.text, "html5lib")
for idx, link in enumerate(soup.find_all('a', href = True)):
print(link)
The above code finds only some of the links on the original website, and in particular, there are no links corresponding to the medal recipients. Even running soup.prettify() shows that these links are not in the parsed text.
I hope to have a simple code that can extract the links for the 24 medal recipients on this website.
If you want to avoid using selenium, there is a simple way to get the data you require. The page loads the data by sending a post requests to he following url,
https://valor.militarytimes.com/api/awards/5?page=1
This sends a json response which is then used to populate the page using JavaScript. All you have to do is send the same request using python-requests and then get the data out of the json response.
import requests
r=requests.post('https://valor.militarytimes.com/api/awards/5?page=1')
for item in r.json()['data']:
name=item['recipient']['name']
url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
print(name,url)
Output:
EUGENE MCCARLEY https://valor.militarytimes.com/hero/500963
TIMOTHY KEENAN https://valor.militarytimes.com/hero/500962
JOHN THOMPSON https://valor.militarytimes.com/hero/500961
WALTER BORDEN https://valor.militarytimes.com/hero/500941
WILLIAM ROSE https://valor.militarytimes.com/hero/94465
YUKITAKA MIZUTARI https://valor.militarytimes.com/hero/94175
ALBERT MARTIN https://valor.militarytimes.com/hero/92498
FRANCIS CODY https://valor.militarytimes.com/hero/500944
JAMES O'KEEFFE https://valor.militarytimes.com/hero/500943
PHILLIP FLEMING https://valor.militarytimes.com/hero/500942
JOHN WANAMAKER https://valor.militarytimes.com/hero/314466
ROBERT CHILSON https://valor.militarytimes.com/hero/102316
CHRISTOPHER NELMS https://valor.militarytimes.com/hero/89255
SAMUEL BARNETT https://valor.militarytimes.com/hero/71533
ANDREW BYERS https://valor.militarytimes.com/hero/500938
ANDREW RUSSELL https://valor.militarytimes.com/hero/500937
****** CALDWELL https://valor.militarytimes.com/hero/500935
****** WALWRATH https://valor.militarytimes.com/hero/500934
****** MADSEN https://valor.militarytimes.com/hero/500933
****** NELSON https://valor.militarytimes.com/hero/500932
WILLIAM SOUKUP https://valor.militarytimes.com/hero/500931
BENJAMIN WILSON https://valor.militarytimes.com/hero/500930
ANDREW MARCKESANO https://valor.militarytimes.com/hero/500929
WAYNE KUNZ https://valor.militarytimes.com/hero/500927
I have fetched the name as well. You can just get the link if you require only that.
Edit
To get urls from multiple pages, use this code
import requests
list_of_urls=[]
last_page=9 #replace this with your last page
for i in range(1,last_page+1):
r=requests.post('https://valor.militarytimes.com/api/awards/5?page={}'.format(i))
for item in r.json()['data']:
url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
list_of_urls.append(url)
print(list_of_urls)
Output:
['https://valor.militarytimes.com/hero/500963', 'https://valor.militarytimes.com/hero/500962', 'https://valor.militarytimes.com/hero/500961', 'https://valor.militarytimes.com/hero/500941', 'https://valor.militarytimes.com/hero/94465', 'https://valor.militarytimes.com/hero/94175', 'https://valor.militarytimes.com/hero/92498', 'https://valor.militarytimes.com/hero/500944', 'https://valor.militarytimes.com/hero/500943', 'https://valor.militarytimes.com/hero/500942', 'https://valor.militarytimes.com/hero/314466', 'https://valor.militarytimes.com/hero/102316', 'https://valor.militarytimes.com/hero/89255', 'https://valor.militarytimes.com/hero/71533', 'https://valor.militarytimes.com/hero/500938', 'https://valor.militarytimes.com/hero/500937', 'https://valor.militarytimes.com/hero/500935', 'https://valor.militarytimes.com/hero/500934', 'https://valor.militarytimes.com/hero/500933', 'https://valor.militarytimes.com/hero/500932', 'https://valor.militarytimes.com/hero/500931', 'https://valor.militarytimes.com/hero/500930', 'https://valor.militarytimes.com/hero/500929', 'https://valor.militarytimes.com/hero/500927', 'https://valor.militarytimes.com/hero/500926', 'https://valor.militarytimes.com/hero/500925', 'https://valor.militarytimes.com/hero/500924', 'https://valor.militarytimes.com/hero/500923', 'https://valor.militarytimes.com/hero/500922', 'https://valor.militarytimes.com/hero/500921', 'https://valor.militarytimes.com/hero/500920', 'https://valor.militarytimes.com/hero/500919', 'https://valor.militarytimes.com/hero/500918', 'https://valor.militarytimes.com/hero/500917', 'https://valor.militarytimes.com/hero/500916', 'https://valor.militarytimes.com/hero/500915', 'https://valor.militarytimes.com/hero/500914', 'https://valor.militarytimes.com/hero/500913', 'https://valor.militarytimes.com/hero/500912', 'https://valor.militarytimes.com/hero/500911', 'https://valor.militarytimes.com/hero/500910', 'https://valor.militarytimes.com/hero/500909', 'https://valor.militarytimes.com/hero/500908', 'https://valor.militarytimes.com/hero/500907', 'https://valor.militarytimes.com/hero/500906', 'https://valor.militarytimes.com/hero/500905', 'https://valor.militarytimes.com/hero/500904', 'https://valor.militarytimes.com/hero/500903', 'https://valor.militarytimes.com/hero/500902', 'https://valor.militarytimes.com/hero/500901', 'https://valor.militarytimes.com/hero/500900', 'https://valor.militarytimes.com/hero/500899', 'https://valor.militarytimes.com/hero/500898', 'https://valor.militarytimes.com/hero/500897', 'https://valor.militarytimes.com/hero/500896', 'https://valor.militarytimes.com/hero/500895', 'https://valor.militarytimes.com/hero/500894', 'https://valor.militarytimes.com/hero/500893', 'https://valor.militarytimes.com/hero/500892', 'https://valor.militarytimes.com/hero/500891', 'https://valor.militarytimes.com/hero/500890', 'https://valor.militarytimes.com/hero/500889', 'https://valor.militarytimes.com/hero/500888', 'https://valor.militarytimes.com/hero/29160', 'https://valor.militarytimes.com/hero/106931', 'https://valor.militarytimes.com/hero/106375', 'https://valor.militarytimes.com/hero/94936', 'https://valor.militarytimes.com/hero/94928', 'https://valor.militarytimes.com/hero/94927', 'https://valor.militarytimes.com/hero/94926', 'https://valor.militarytimes.com/hero/94923', 'https://valor.militarytimes.com/hero/94777', 'https://valor.militarytimes.com/hero/94769', 'https://valor.militarytimes.com/hero/94711', 'https://valor.militarytimes.com/hero/94644', 'https://valor.militarytimes.com/hero/94571', 'https://valor.militarytimes.com/hero/94570', 'https://valor.militarytimes.com/hero/94494', 'https://valor.militarytimes.com/hero/94468', 'https://valor.militarytimes.com/hero/94454', 'https://valor.militarytimes.com/hero/94388', 'https://valor.militarytimes.com/hero/94358', 'https://valor.militarytimes.com/hero/94279', 'https://valor.militarytimes.com/hero/94275', 'https://valor.militarytimes.com/hero/94253', 'https://valor.militarytimes.com/hero/94251', 'https://valor.militarytimes.com/hero/94223', 'https://valor.militarytimes.com/hero/94222', 'https://valor.militarytimes.com/hero/94217', 'https://valor.militarytimes.com/hero/94211', 'https://valor.militarytimes.com/hero/94210', 'https://valor.militarytimes.com/hero/94195', 'https://valor.militarytimes.com/hero/94194', 'https://valor.militarytimes.com/hero/94173', 'https://valor.militarytimes.com/hero/94168', 'https://valor.militarytimes.com/hero/94055', 'https://valor.militarytimes.com/hero/93916', 'https://valor.militarytimes.com/hero/93847', 'https://valor.militarytimes.com/hero/93780', 'https://valor.militarytimes.com/hero/93779', 'https://valor.militarytimes.com/hero/93775', 'https://valor.militarytimes.com/hero/93774', 'https://valor.militarytimes.com/hero/93733', 'https://valor.militarytimes.com/hero/93722', 'https://valor.militarytimes.com/hero/93706', 'https://valor.militarytimes.com/hero/93551', 'https://valor.militarytimes.com/hero/93435', 'https://valor.militarytimes.com/hero/93407', 'https://valor.militarytimes.com/hero/93374', 'https://valor.militarytimes.com/hero/93277', 'https://valor.militarytimes.com/hero/93243', 'https://valor.militarytimes.com/hero/93193', 'https://valor.militarytimes.com/hero/92989', 'https://valor.militarytimes.com/hero/92972', 'https://valor.militarytimes.com/hero/92958', 'https://valor.militarytimes.com/hero/93923', 'https://valor.militarytimes.com/hero/90130', 'https://valor.militarytimes.com/hero/90128', 'https://valor.militarytimes.com/hero/89704', 'https://valor.militarytimes.com/hero/89703', 'https://valor.militarytimes.com/hero/89702', 'https://valor.militarytimes.com/hero/89701', 'https://valor.militarytimes.com/hero/89698', 'https://valor.militarytimes.com/hero/89673', 'https://valor.militarytimes.com/hero/89661', 'https://valor.militarytimes.com/hero/90127', 'https://valor.militarytimes.com/hero/89535', 'https://valor.militarytimes.com/hero/89493', 'https://valor.militarytimes.com/hero/89406', 'https://valor.militarytimes.com/hero/89405', 'https://valor.militarytimes.com/hero/89404', 'https://valor.militarytimes.com/hero/89261', 'https://valor.militarytimes.com/hero/89259', 'https://valor.militarytimes.com/hero/88805', 'https://valor.militarytimes.com/hero/88803', 'https://valor.militarytimes.com/hero/88789', 'https://valor.militarytimes.com/hero/88770', 'https://valor.militarytimes.com/hero/88766', 'https://valor.militarytimes.com/hero/88765', 'https://valor.militarytimes.com/hero/88719', 'https://valor.militarytimes.com/hero/88680', 'https://valor.militarytimes.com/hero/88679', 'https://valor.militarytimes.com/hero/88678', 'https://valor.militarytimes.com/hero/88658', 'https://valor.militarytimes.com/hero/88657', 'https://valor.militarytimes.com/hero/88616', 'https://valor.militarytimes.com/hero/88578', 'https://valor.militarytimes.com/hero/88551', 'https://valor.militarytimes.com/hero/88445', 'https://valor.militarytimes.com/hero/88366', 'https://valor.militarytimes.com/hero/88365', 'https://valor.militarytimes.com/hero/88045', 'https://valor.militarytimes.com/hero/88044', 'https://valor.militarytimes.com/hero/88013', 'https://valor.militarytimes.com/hero/88012', 'https://valor.militarytimes.com/hero/87986', 'https://valor.militarytimes.com/hero/87918', 'https://valor.militarytimes.com/hero/87909', 'https://valor.militarytimes.com/hero/87898', 'https://valor.militarytimes.com/hero/87830', 'https://valor.militarytimes.com/hero/88570', 'https://valor.militarytimes.com/hero/88568', 'https://valor.militarytimes.com/hero/88239', 'https://valor.militarytimes.com/hero/87792', 'https://valor.militarytimes.com/hero/87782', 'https://valor.militarytimes.com/hero/87677', 'https://valor.militarytimes.com/hero/87655', 'https://valor.militarytimes.com/hero/87523', 'https://valor.militarytimes.com/hero/87460', 'https://valor.militarytimes.com/hero/87292', 'https://valor.militarytimes.com/hero/87291', 'https://valor.militarytimes.com/hero/87288', 'https://valor.militarytimes.com/hero/87283', 'https://valor.militarytimes.com/hero/87282', 'https://valor.militarytimes.com/hero/87281', 'https://valor.militarytimes.com/hero/87280', 'https://valor.militarytimes.com/hero/87279', 'https://valor.militarytimes.com/hero/87272', 'https://valor.militarytimes.com/hero/86875', 'https://valor.militarytimes.com/hero/86811', 'https://valor.militarytimes.com/hero/86451', 'https://valor.militarytimes.com/hero/86077', 'https://valor.militarytimes.com/hero/86076', 'https://valor.militarytimes.com/hero/85994', 'https://valor.militarytimes.com/hero/86005', 'https://valor.militarytimes.com/hero/6190', 'https://valor.militarytimes.com/hero/5022', 'https://valor.militarytimes.com/hero/500877', 'https://valor.militarytimes.com/hero/500851', 'https://valor.militarytimes.com/hero/500844', 'https://valor.militarytimes.com/hero/500843', 'https://valor.militarytimes.com/hero/500842', 'https://valor.militarytimes.com/hero/500841', 'https://valor.militarytimes.com/hero/500840', 'https://valor.militarytimes.com/hero/500839', 'https://valor.militarytimes.com/hero/500838', 'https://valor.militarytimes.com/hero/500837', 'https://valor.militarytimes.com/hero/500836', 'https://valor.militarytimes.com/hero/500835', 'https://valor.militarytimes.com/hero/500834', 'https://valor.militarytimes.com/hero/500833', 'https://valor.militarytimes.com/hero/500832', 'https://valor.militarytimes.com/hero/500831', 'https://valor.militarytimes.com/hero/500830', 'https://valor.militarytimes.com/hero/500829', 'https://valor.militarytimes.com/hero/500827', 'https://valor.militarytimes.com/hero/500826', 'https://valor.militarytimes.com/hero/500817', 'https://valor.militarytimes.com/hero/500816', 'https://valor.militarytimes.com/hero/500815', 'https://valor.militarytimes.com/hero/500813', 'https://valor.militarytimes.com/hero/500808', 'https://valor.militarytimes.com/hero/401188', 'https://valor.militarytimes.com/hero/401185', 'https://valor.militarytimes.com/hero/89851', 'https://valor.militarytimes.com/hero/89846']
You can use both selenium webdriver & beautiful soup
from selenium import webdriver
import time
from bs4 import BeautifulSoup
url = 'https://valor.militarytimes.com/award/5?page=1'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('window-size=1920x1080');
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(10)
page=driver.page_source
soup=BeautifulSoup(page,'lxml')
items = soup.select('a',href=True)
hero=[]
for item in items:
if 'hero' in item['href']:
print(item['href'])
hero.append(item['href'])
print(hero)
Output:
/hero/500963
/hero/500962
/hero/500961
/hero/500941
/hero/94465
/hero/94175
/hero/92498
/hero/500944
/hero/500943
/hero/500942
/hero/314466
/hero/102316
/hero/89255
/hero/71533
/hero/500938
/hero/500937
/hero/500935
/hero/500934
/hero/500933
/hero/500932
/hero/500931
/hero/500930
/hero/500929
/hero/500927
['/hero/500963', '/hero/500962', '/hero/500961', '/hero/500941', '/hero/94465', '/hero/94175', '/hero/92498', '/hero/500944', '/hero/500943', '/hero/500942', '/hero/314466', '/hero/102316', '/hero/89255', '/hero/71533', '/hero/500938', '/hero/500937', '/hero/500935', '/hero/500934', '/hero/500933', '/hero/500932', '/hero/500931', '/hero/500930', '/hero/500929', '/hero/500927']
You can make POST requests to API to retrieve json containing the ids for each recipient you can concatenate onto a base url to give the full url for each recipient. The json contains the url of the last page so you can determine the end point for a subsequent loop over all pages.
import requests
import pandas as pd
baseUrl = 'https://valor.militarytimes.com/hero/'
url = 'https://valor.militarytimes.com/api/awards/5?page=1'
headers = {
'Accept' : 'application/json, text/plain, */*' ,
'Referer' : 'https://valor.militarytimes.com/award/5?page=1',
'User-Agent' : 'Mozilla/5.0'
}
info = requests.post(url, headers = headers, data = '').json()
urls = [baseUrl + str(item['recipient']['id']) for item in info['data']] #page 1
linksInfo = info['links']
firstLink = linksInfo['first']
lastLink = linksInfo['last']
lastPage = lastLink.replace('https://valor.militarytimes.com/api/awards/5?page=','')
print('last page = ' + lastPage)
print(urls)
I had been testing with retrieving all results and noticed you would need potentially back off and retry.
You can build the additional urls as follows:
if lastPage > 1:
for page in range(2, lastPage + 1):
url = 'https://valor.militarytimes.com/api/awards/5?page={}'.format(page)
Related
Any easy way to extract details from a HTM webpage?
I am trying to extract the following address from the 10-Q on this webpage and need help getting it to work: https://www.sec.gov/ix?doc=/Archives/edgar/data/1318605/000095017022012936/tsla-20220630.htm 1 Tesla Road Austin, Texas URL = f'https://www.sec.gov/ix?doc=/Archives/edgar/data/{cik}/{accessionNumber}/{primaryDocument}' response = requests.get(URL, headers = headers) soup = BeautifulSoup(response.content, "html.parser") soup.find_all('dei:EntityAddressAddressLine1') Where: cik = 0001318605 accessionNumber = 000095017022012936 primaryDocument = tsla-20220630.htm
Unfortently, because I am running this on DataBricks, using Selenium isn't an immediate solution I can take. However, it does look like this method works! r = requests.get(f'https://www.sec.gov/Archives/edgar/data/{cik}/{accessionNumber.replace("-", "")}/{accessionNumber}.txt', headers=headers) raw_10k = r.text city = raw_10k.split('Entity Address, City or Town</a></td>\n<td class="text">')[1].split('<span></span>')[0] print(city)
As you have already realized, the data is added from the https://www.sec.gov/Archives.... site, and you would need something like selenium to get it from the https://www.sec.gov/ix?doc=/Archives.... site. [The URL I used was https://www.sec.gov/Archives/edgar/data/1318605/000095017022012936/tsla-20220630.htm and I just copied the cookies and headers from my own browser to pass into the request. I tried to open the link in your answer, but I got a NoSuchKey error...] If you've managed to fetch a html containing 10-Q form, I feel that the simplest way to extract the address would be with css selectors [s.text for s in soup.select('td *[name^="dei:EntityAddress"]')] will return ['1 Tesla Road', 'Austin', 'Texas', '78725'] and so, with print(', '.join([ s.get_text(strip=True) for s in soup.select('p>span *[name^="dei:EntityAddress"]') if 'ZipCode' not in s.get('name') # excludes zipcode ])) 1 Tesla Road, Austin, Texas will be printed. You can also use addrsCell = soup.find(attrs={'name':'dei:EntityAddressAddressLine1'}) if addrsCell and addrsCell.find_parent('td'): # is not None print(' '.join([ s.text for s in addrsCell.find_parent('td').select('p')])) to get 1 Tesla Road Austin, Texas, which is exactly as you formatted it in your question.
Web Scryping in Python
I was trying to scrape a website for some university project. The website is https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813. I have a problem with my python code. What I want to obtain is all the reviews for the pages from 1 to 5, but instead I get all [].Any help would be appreciated! Here is the code: import csv from bs4 import BeautifulSoup import urllib.request import re import pandas as pd import requests reviewlist = [] class AppURLopener(urllib.request.FancyURLopener): version = "Mozilla/5.0" opener = AppURLopener() response = opener.open('https://www.bonprix.it/prodotto/leggings-a-pinocchietto-pacco-da-2-leggings-a-pinocchietto-pacco-da-2-bianco-nero-956015/?itemOptionId=12211813') soup = BeautifulSoup(response,'html.parser') reviews = soup.find_all('div',{'class':'reviewContent'}) for i in reviews: review = { 'per_review_name' : i.find('span',{'itemprop':'name'}).text.strip(), 'per_review' : i.find('p',{'class':'reviewText'}).text.strip(), 'per_review_taglia' : i.find('p',{'class':'singleReviewSizeDescr'}).text.strip(), } reviewlist.append(review) for page in range (1,5): prova = soup.find_all('div',{'data-page': '{page}'}) print(prova) print(len(reviewlist)) df = pd.DataFrame(reviewlist) df.to_csv('list.csv',index=False) print('Fine.') And here the output that I get: [] 5 [] 5 [] 5 [] 5 Fine.
As I understand it the site uses Javascript to load most of its content, therfore you cant scrape that data, as it isn't loaded initially, but you can use the rating backend for your product site the link is: https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page=1&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters= You can go through the pages by changing the page parameter in the url/get request, the link returns a html document of the rating page an you can get the rating from the rating value meta tag
The website only loads first page of the reviews in the first request. If you inspect its requests, you can see that it requests for additional data when you change the page of the reviews. You can rewrite your code as following to get the reviews from all pages: reviews_dom = [] for page in range(1,6): url = f"https://www.bonprix.it/reviews/list/?styleId=31436999&sortby=date&page={page}&rating=0&variant=0&size=0&bodyHeight=0&showOldReviews=true&xxl=false&variantFilters=" r = requests.request("GET", url) soup = BeautifulSoup(r.text, "html.parser") reviews_dom += soup.find_all("div", attrs={"class": "reviewContent"}) reviews = [] for review_item in reviews_dom: review = { 'per_review_name' : review_item.find('span', attrs={'itemprop':'name'}).text.strip(), 'per_review' : review_item.find('p', attrs={'class':'reviewText'}).text.strip(), 'per_review_taglia' : review_item.find('p', attrs={'class':'singleReviewSizeDescr'}).text.strip(), } reviews.append(review) print(len(reviews)) print(reviews) What happens in the code? In the first iteration, we request the data for each page of reviews (first 5 pages in the above example). In the second iteration, we parse the reviews dom and extract the data we need.
Generating URL for Yahoo news and Bing news with Python and BeautifulSoup
I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted. I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404 Can you please help me with it? This is the code for 'Bing' from bs4 import BeautifulSoup import requests term = 'usa' url = 'http://www.bing.com/news/q?s={}'.format(term) response = requests.get(url) print(response) soup = BeautifulSoup(response.text, 'html.parser') print(soup) And this is for Yahoo: term = 'usa' url = 'http://news.search.yahoo.com/q?s={}'.format(term) response = requests.get(url) print(response) soup = BeautifulSoup(response.text, 'html.parser') print(soup) Please help me to generate these urls, whats the logic behind them, Im still a noob :)
Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..). Bing from bs4 import BeautifulSoup import requests import re term = 'usa' url = 'https://www.bing.com/news/search?q={}'.format(term) response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') for news_card in soup.find_all('div', class_="news-card-body"): title = news_card.find('a', class_="title").text time = news_card.find( 'span', attrs={'aria-label': re.compile(".*ago$")} ).text print("{} ({})".format(title, time)) Output Jason Mohammed blitzkrieg sinks USA (17h) USA Swimming held not liable by California jury in sexual abuse case (1d) United States 4-1 Canada: USA secure payback in Nations League (1d) USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d) ... Yahoo from bs4 import BeautifulSoup import requests term = 'usa' url = 'https://news.search.yahoo.com/search?q={}'.format(term) response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') for news_item in soup.find_all('div', class_='NewsArticle'): title = news_item.find('h4').text time = news_item.find('span', class_='fc-2nd').text # Clean time text time = time.replace('·', '').strip() print("{} ({})".format(title, time)) Output USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago) Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago) Mexico defeats USA in extras for Olympic berth (13 hours ago) ...
Dynamic Web scraping
I am trying to scrape this page ("http://www.arohan.in/branch-locator.php") in which when I select the state and city, an address will be displayed and I have to write the state,city and address in csv/excel file. I am able to reach this till step, now I am stuck. Here is my code: from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe" driver =webdriver.Chrome(chrome_path) driver.get("http://www.arohan.in/branch-locator.php") select = Select(driver.find_element_by_name('state')) select.select_by_visible_text('Bihar') drop = Select(driver.find_element_by_name('branch')) city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[#id='city1']/option[text()='Gaya']")) city_option.click()
Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza. Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.
In a slightly organized manner: import requests from bs4 import BeautifulSoup link = "http://www.arohan.in/branch-locator.php?" def get_links(session,url,payload): session.headers["User-Agent"] = "Mozilla/5.0" res = session.get(url,params=payload) soup = BeautifulSoup(res.text,"lxml") item = [item.text for item in soup.select(".address_area p")] print(item) if __name__ == '__main__': for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']): payload = { 'state':st , 'branch':br } with requests.Session() as session: get_links(session,link,payload) Output: ['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A'] ['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact#arohan.in']
A better approach would be to avoid using selenium. That is useful if you require the javascript processing required to render the HTML. In your case, this is not needed. The required information is already contained within the HTML. What is needed is to first make a request to get a page containing all of the states. Then for each state, request the list of branch. Then for each state/branch combination, a URL request can be made to get the HTML containing the address. This happens to be contained in the second <li> entry following a <ul class='address_area'> entry: from bs4 import BeautifulSoup import requests import csv import time # Get a list of available states r = requests.get('http://www.arohan.in/branch-locator.php') soup = BeautifulSoup(r.text, 'html.parser') state_select = soup.find('select', id='state1') states = [option.text for option in state_select.find_all('option')[1:]] # Open an output CSV file with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output: csv_output = csv.writer(f_output) csv_output.writerow(['State', 'Branch', 'Address']) # For each state determine the available branches for state in states: r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state}) soup = BeautifulSoup(r_branches.text, 'html.parser') # For each branch, request a page contain the address for option in soup.find_all('option')[1:]: time.sleep(0.5) # Reduce server loading branch = option.text print("{}, {}".format(state, branch)) r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch}) soup_branch = BeautifulSoup(r_branch.text, 'html.parser') ul = soup_branch.find('ul', class_='address_area') if ul: address = ul.find_all('li')[1].get_text(strip=True) row = [state, branch, address] csv_output.writerow(row) else: print(soup_branch.title) Giving you an output CSV file starting: State,Branch,Address West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091" West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)" West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002" You should slow the script down using a time.sleep(0.5) to avoid too much loading on the server. Note: [1:] is used as the first item in the drop down lists is not a branch or state, but a Select Branch entry.
How to scrape a content from links stored in a list using beautiful soup?
I want to scrape title of all the post present in a main website. main is a list which contains 6 or 7 url in it: import requests from bs4 import Beautifulsoup r=requests.get("https://forums.oneplus.com/") s=BeautifulSoup(r.content) links=s.find_all("a",{"class" : "focus-content"}) url2=[] for link in links: url2.append(link.get("href")) url1="https://forums.oneplus.com/" for u in url2: main=url1+u print(main) for m in main: r1=requests.get(m) s1=BeautifulSoup(r1) title=s1.find("span", {"class" : "title"}) print(title)
You need to declare your variable main as a list. In your code you update the variable main in each iteration of the loop. At the end, the main will be string containing the last url from url2 list. If you then supply main to next loop, it will iterate over individual characters. After few cosmetic changes this should get your titles: import requests from bs4 import BeautifulSoup r=requests.get("https://forums.oneplus.com/") s=BeautifulSoup(r.content, 'lxml') links=s.find_all("a",{"class" : "focus-content"}) url2=[] for link in links: url2.append(link.get("href")) url1 = "https://forums.oneplus.com/" main = [] for u in url2: main.append(url1 + u) for m in main: r1 = requests.get(m) s1 = BeautifulSoup(r1.text, 'lxml') title=s1.find("span", {"class" : "title"}) print(title.text.strip()) Prints: Weekly 240: We release the updates and get reading Shot on OnePlus: Part 6 – Best Slo-mo Video / Animal Photos Android P Beta Developer Preview 3 for OnePlus 6 [Let's Talk] To whom are you gonna give your appreciation in this Community? [Let's Talk] What Does Your OnePlus Device Replace? [Let's Talk] Loyalty to tech companies