Any easy way to extract details from a HTM webpage? - python

I am trying to extract the following address from the 10-Q on this webpage and need help getting it to work: https://www.sec.gov/ix?doc=/Archives/edgar/data/1318605/000095017022012936/tsla-20220630.htm
1 Tesla Road
Austin, Texas
URL = f'https://www.sec.gov/ix?doc=/Archives/edgar/data/{cik}/{accessionNumber}/{primaryDocument}'
response = requests.get(URL, headers = headers)
soup = BeautifulSoup(response.content, "html.parser")
soup.find_all('dei:EntityAddressAddressLine1')
Where:
cik = 0001318605
accessionNumber = 000095017022012936
primaryDocument
= tsla-20220630.htm

Unfortently, because I am running this on DataBricks, using Selenium isn't an immediate solution I can take. However, it does look like this method works!
r = requests.get(f'https://www.sec.gov/Archives/edgar/data/{cik}/{accessionNumber.replace("-", "")}/{accessionNumber}.txt', headers=headers)
raw_10k = r.text
city = raw_10k.split('Entity Address, City or Town</a></td>\n<td class="text">')[1].split('<span></span>')[0]
print(city)

As you have already realized, the data is added from the https://www.sec.gov/Archives.... site, and you would need something like selenium to get it from the https://www.sec.gov/ix?doc=/Archives.... site.
[The URL I used was https://www.sec.gov/Archives/edgar/data/1318605/000095017022012936/tsla-20220630.htm and I just copied the cookies and headers from my own browser to pass into the request. I tried to open the link in your answer, but I got a NoSuchKey error...]
If you've managed to fetch a html containing 10-Q form, I feel that the simplest way to extract the address would be with css selectors
[s.text for s in soup.select('td *[name^="dei:EntityAddress"]')]
will return ['1 Tesla Road', 'Austin', 'Texas', '78725'] and so, with
print(', '.join([
s.get_text(strip=True) for s in
soup.select('p>span *[name^="dei:EntityAddress"]')
if 'ZipCode' not in s.get('name') # excludes zipcode
]))
1 Tesla Road, Austin, Texas will be printed.
You can also use
addrsCell = soup.find(attrs={'name':'dei:EntityAddressAddressLine1'})
if addrsCell and addrsCell.find_parent('td'): # is not None
print(' '.join([
s.text for s in addrsCell.find_parent('td').select('p')]))
to get 1 Tesla Road Austin, Texas, which is exactly as you formatted it in your question.

Related

Is there a way to separate strings in HTML?

I'm trying to get the address of some companies from WSJ.com. However, I couldn't figure out a reliable way to separate the city and the state/province from the HTML page.
here's my code and output
code = "TURN"
url = "https://www.wsj.com/market-data/quotes/{}".format(code)
headers = {'User-Agent':str(ua.random)}
page = requests.get(url, headers = headers)
page.encoding = page.apparent_encoding
pageText = page.text
soup = BeautifulSoup(pageText, 'html.parser')
address = soup.find('div', {"class" : "WSJTheme--contact--bDuH_KYx"}).contents[0]
print(address.contents[2])
Output: <span class="">Montclair New Jersey 07042</span>
I want to get a result like [Montclair, New Jersey]. However, I cant simply separate the string by space since there are inputs like "San Diego California 92130" or "Beijing Beijing 100022" which requires different rules to separate them.
They are separated strings in the original HTML code, I'm not sure if this helps.
<span class="">
"Montclair"
"New Jersey"
"07042"
</span>
I would suggest grabbing the zip code and then using a library like: https://pypi.org/project/zipcodes/
If html really looks like you portrayed it, you can simply split at quotes.
a = address.contents[2].text
b = a.split('"', 4)
city = b[1]
state = b[3]
print(f"{city}, {state}")
output: Montclair, New Jersey

need to extract link and text from the anchor tag using beautiful soup

I am working on to extract link and text from from anchor tag using beautiful soup
The below code is from where i have to extract the data from anchor tag which is link and the text
Mumbai: Vaccination figures surge in private hospitals, stagnate in government centres
Chennai: Martial arts instructor arrested following allegations of sexual assault
Mumbai Metro lines 2A and 7: Here is everything you need to know
**Python code to extract the content from the above code.**
#app.get('/indian_express', response_class=HTMLResponse)
async def dna_india(request: Request):
print("1111111111111111")
dict={}
URL="https://indianexpress.com/latest-news/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
results=soup.find_all('div', class_="nation")
for results_element in results:
results_element_1 = soup.find_all('div', class_="title")
for results_element_2 in results_element_1:
for results_element_3 in results_element_2:
print(results_element_3) **The above printed html code is because of this print**
print(" ")
link_element=results_element_3.find_all('a', class_="title", href=True) **I am getting empty [] when i try to print here **
# print(link_element)
# title_elem = results_element_3.find('a')['href']
# link_element=results_element_3.find('a').contents[0]
# print(title_elem)
# print(link_element)
# for index,(title,link) in enumerate(zip(title_elem, link_element)):
# dict[str(title.text)]=str(link['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
#app.get('/deccan_chronicle', response_class=HTMLResponse)
async def deccan_chronicle(request: Request):
dict={}
URL="https://www.news18.com/india/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
main_div = soup.find("div", class_="blog-list")
for i in main_div:
#link_data = i.find("div", class_="blog-list-blog").find("a")
link_data=i.find("div",class_="blog-list-blog").find("a")
text_data = link_data.text
dict[str(text_data)] = str(link_data.attrs['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
Please help me out with this code
You can find main_div tag which has all the records of news in which you can find articles where all data is defined and iterating over that articles title can be extract using finding proper a tag which contain title as well as herf of same!
import requests
from bs4 import BeautifulSoup
res=requests.get("https://indianexpress.com/latest-news/")
soup=BeautifulSoup(res.text,"html.parser")
main_div=soup.find("div",class_="nation")
articles=main_div.find_all("div",class_="articles")
for i in articles:
href=i.find("div",class_="title").find("a")
print(href.attrs['href'])
text_data=href.text
print(text_data)
Output:
https://indianexpress.com/article/business/banking-and-finance/banks-cant-cite-2018-rbi-circular-to-caution-clients-on-virtual-currencies-7338628/
Banks can’t cite 2018 RBI circular to caution clients on virtual currencies
https://indianexpress.com/article/india/supreme-court-stays-delhi-high-court-order-on-levy-of-igst-on-imported-oxygen-concentrators-for-personal-use-7339478/
Supreme Court stays Delhi High Court order on levy of IGST on imported oxygen concentrators for personal use
...
2nd Method
Dont make so complex just observe tags what data they contain like i have found main tag main_div and then go for tag which contains text as well as links and you can find it in h4 tag and iterate over it !
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.news18.com/india/")
soup=BeautifulSoup(res.text,"html.parser")
main_div = soup.find("div", class_="blog-list")
data=main_div.find_all("h4")
for i in data:
print(i.find("a")['href'])
print(i.find("a").text)
output:
https://www.news18.com/news/india/2-killed-six-injured-after-portion-of-two-storey-building-collapses-in-varanasi-pm-assures-help-3799610.html
2 Killed, Six Injured After Portion of Two-Storey Building Collapses in Varanasi; PM Assures Help
https://www.news18.com/news/india/dont-compel-citizens-to-move-courts-again-again-follow-national-litigation-policy-hc-tells-centre-3799598.html
Don't Compel Citizens to Move Courts Again & Again, Follow National Litigation Policy, HC Tells Centre
...

Generating URL for Yahoo news and Bing news with Python and BeautifulSoup

I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.
I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404
Can you please help me with it?
This is the code for 'Bing'
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
And this is for Yahoo:
term = 'usa'
url = 'http://news.search.yahoo.com/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Please help me to generate these urls, whats the logic behind them, Im still a noob :)
Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).
Bing
from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
title = news_card.find('a', class_="title").text
time = news_card.find(
'span',
attrs={'aria-label': re.compile(".*ago$")}
).text
print("{} ({})".format(title, time))
Output
Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...
Yahoo
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
title = news_item.find('h4').text
time = news_item.find('span', class_='fc-2nd').text
# Clean time text
time = time.replace('·', '').strip()
print("{} ({})".format(title, time))
Output
USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...

Links on original webpage missing after parsing with beautiful soup

Please excuse me if my explanation seems elementary. I'm new to both python and beautiful soup.
I'm trying to extract data from the following website :
https://valor.militarytimes.com/award/5?page=1
I want to extract the links that correspond to each of the 24 medal recipients on the website. I can see from Firefox inspector that they all have the word 'hero' in their links. However, when I use beautiful soup to parse the website, these links do not appear.
I have tried using the standard html parser as well as the html5lib parser but none of them show the links corresponding to these medal recipients.
page = requests.get('https://valor.militarytimes.com/award/5?page=1')
soup = BeautifulSoup(page.text, "html5lib")
for idx, link in enumerate(soup.find_all('a', href = True)):
print(link)
The above code finds only some of the links on the original website, and in particular, there are no links corresponding to the medal recipients. Even running soup.prettify() shows that these links are not in the parsed text.
I hope to have a simple code that can extract the links for the 24 medal recipients on this website.
If you want to avoid using selenium, there is a simple way to get the data you require. The page loads the data by sending a post requests to he following url,
https://valor.militarytimes.com/api/awards/5?page=1
This sends a json response which is then used to populate the page using JavaScript. All you have to do is send the same request using python-requests and then get the data out of the json response.
import requests
r=requests.post('https://valor.militarytimes.com/api/awards/5?page=1')
for item in r.json()['data']:
name=item['recipient']['name']
url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
print(name,url)
Output:
EUGENE MCCARLEY https://valor.militarytimes.com/hero/500963
TIMOTHY KEENAN https://valor.militarytimes.com/hero/500962
JOHN THOMPSON https://valor.militarytimes.com/hero/500961
WALTER BORDEN https://valor.militarytimes.com/hero/500941
WILLIAM ROSE https://valor.militarytimes.com/hero/94465
YUKITAKA MIZUTARI https://valor.militarytimes.com/hero/94175
ALBERT MARTIN https://valor.militarytimes.com/hero/92498
FRANCIS CODY https://valor.militarytimes.com/hero/500944
JAMES O'KEEFFE https://valor.militarytimes.com/hero/500943
PHILLIP FLEMING https://valor.militarytimes.com/hero/500942
JOHN WANAMAKER https://valor.militarytimes.com/hero/314466
ROBERT CHILSON https://valor.militarytimes.com/hero/102316
CHRISTOPHER NELMS https://valor.militarytimes.com/hero/89255
SAMUEL BARNETT https://valor.militarytimes.com/hero/71533
ANDREW BYERS https://valor.militarytimes.com/hero/500938
ANDREW RUSSELL https://valor.militarytimes.com/hero/500937
****** CALDWELL https://valor.militarytimes.com/hero/500935
****** WALWRATH https://valor.militarytimes.com/hero/500934
****** MADSEN https://valor.militarytimes.com/hero/500933
****** NELSON https://valor.militarytimes.com/hero/500932
WILLIAM SOUKUP https://valor.militarytimes.com/hero/500931
BENJAMIN WILSON https://valor.militarytimes.com/hero/500930
ANDREW MARCKESANO https://valor.militarytimes.com/hero/500929
WAYNE KUNZ https://valor.militarytimes.com/hero/500927
I have fetched the name as well. You can just get the link if you require only that.
Edit
To get urls from multiple pages, use this code
import requests
list_of_urls=[]
last_page=9 #replace this with your last page
for i in range(1,last_page+1):
r=requests.post('https://valor.militarytimes.com/api/awards/5?page={}'.format(i))
for item in r.json()['data']:
url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
list_of_urls.append(url)
print(list_of_urls)
Output:
['https://valor.militarytimes.com/hero/500963', 'https://valor.militarytimes.com/hero/500962', 'https://valor.militarytimes.com/hero/500961', 'https://valor.militarytimes.com/hero/500941', 'https://valor.militarytimes.com/hero/94465', 'https://valor.militarytimes.com/hero/94175', 'https://valor.militarytimes.com/hero/92498', 'https://valor.militarytimes.com/hero/500944', 'https://valor.militarytimes.com/hero/500943', 'https://valor.militarytimes.com/hero/500942', 'https://valor.militarytimes.com/hero/314466', 'https://valor.militarytimes.com/hero/102316', 'https://valor.militarytimes.com/hero/89255', 'https://valor.militarytimes.com/hero/71533', 'https://valor.militarytimes.com/hero/500938', 'https://valor.militarytimes.com/hero/500937', 'https://valor.militarytimes.com/hero/500935', 'https://valor.militarytimes.com/hero/500934', 'https://valor.militarytimes.com/hero/500933', 'https://valor.militarytimes.com/hero/500932', 'https://valor.militarytimes.com/hero/500931', 'https://valor.militarytimes.com/hero/500930', 'https://valor.militarytimes.com/hero/500929', 'https://valor.militarytimes.com/hero/500927', 'https://valor.militarytimes.com/hero/500926', 'https://valor.militarytimes.com/hero/500925', 'https://valor.militarytimes.com/hero/500924', 'https://valor.militarytimes.com/hero/500923', 'https://valor.militarytimes.com/hero/500922', 'https://valor.militarytimes.com/hero/500921', 'https://valor.militarytimes.com/hero/500920', 'https://valor.militarytimes.com/hero/500919', 'https://valor.militarytimes.com/hero/500918', 'https://valor.militarytimes.com/hero/500917', 'https://valor.militarytimes.com/hero/500916', 'https://valor.militarytimes.com/hero/500915', 'https://valor.militarytimes.com/hero/500914', 'https://valor.militarytimes.com/hero/500913', 'https://valor.militarytimes.com/hero/500912', 'https://valor.militarytimes.com/hero/500911', 'https://valor.militarytimes.com/hero/500910', 'https://valor.militarytimes.com/hero/500909', 'https://valor.militarytimes.com/hero/500908', 'https://valor.militarytimes.com/hero/500907', 'https://valor.militarytimes.com/hero/500906', 'https://valor.militarytimes.com/hero/500905', 'https://valor.militarytimes.com/hero/500904', 'https://valor.militarytimes.com/hero/500903', 'https://valor.militarytimes.com/hero/500902', 'https://valor.militarytimes.com/hero/500901', 'https://valor.militarytimes.com/hero/500900', 'https://valor.militarytimes.com/hero/500899', 'https://valor.militarytimes.com/hero/500898', 'https://valor.militarytimes.com/hero/500897', 'https://valor.militarytimes.com/hero/500896', 'https://valor.militarytimes.com/hero/500895', 'https://valor.militarytimes.com/hero/500894', 'https://valor.militarytimes.com/hero/500893', 'https://valor.militarytimes.com/hero/500892', 'https://valor.militarytimes.com/hero/500891', 'https://valor.militarytimes.com/hero/500890', 'https://valor.militarytimes.com/hero/500889', 'https://valor.militarytimes.com/hero/500888', 'https://valor.militarytimes.com/hero/29160', 'https://valor.militarytimes.com/hero/106931', 'https://valor.militarytimes.com/hero/106375', 'https://valor.militarytimes.com/hero/94936', 'https://valor.militarytimes.com/hero/94928', 'https://valor.militarytimes.com/hero/94927', 'https://valor.militarytimes.com/hero/94926', 'https://valor.militarytimes.com/hero/94923', 'https://valor.militarytimes.com/hero/94777', 'https://valor.militarytimes.com/hero/94769', 'https://valor.militarytimes.com/hero/94711', 'https://valor.militarytimes.com/hero/94644', 'https://valor.militarytimes.com/hero/94571', 'https://valor.militarytimes.com/hero/94570', 'https://valor.militarytimes.com/hero/94494', 'https://valor.militarytimes.com/hero/94468', 'https://valor.militarytimes.com/hero/94454', 'https://valor.militarytimes.com/hero/94388', 'https://valor.militarytimes.com/hero/94358', 'https://valor.militarytimes.com/hero/94279', 'https://valor.militarytimes.com/hero/94275', 'https://valor.militarytimes.com/hero/94253', 'https://valor.militarytimes.com/hero/94251', 'https://valor.militarytimes.com/hero/94223', 'https://valor.militarytimes.com/hero/94222', 'https://valor.militarytimes.com/hero/94217', 'https://valor.militarytimes.com/hero/94211', 'https://valor.militarytimes.com/hero/94210', 'https://valor.militarytimes.com/hero/94195', 'https://valor.militarytimes.com/hero/94194', 'https://valor.militarytimes.com/hero/94173', 'https://valor.militarytimes.com/hero/94168', 'https://valor.militarytimes.com/hero/94055', 'https://valor.militarytimes.com/hero/93916', 'https://valor.militarytimes.com/hero/93847', 'https://valor.militarytimes.com/hero/93780', 'https://valor.militarytimes.com/hero/93779', 'https://valor.militarytimes.com/hero/93775', 'https://valor.militarytimes.com/hero/93774', 'https://valor.militarytimes.com/hero/93733', 'https://valor.militarytimes.com/hero/93722', 'https://valor.militarytimes.com/hero/93706', 'https://valor.militarytimes.com/hero/93551', 'https://valor.militarytimes.com/hero/93435', 'https://valor.militarytimes.com/hero/93407', 'https://valor.militarytimes.com/hero/93374', 'https://valor.militarytimes.com/hero/93277', 'https://valor.militarytimes.com/hero/93243', 'https://valor.militarytimes.com/hero/93193', 'https://valor.militarytimes.com/hero/92989', 'https://valor.militarytimes.com/hero/92972', 'https://valor.militarytimes.com/hero/92958', 'https://valor.militarytimes.com/hero/93923', 'https://valor.militarytimes.com/hero/90130', 'https://valor.militarytimes.com/hero/90128', 'https://valor.militarytimes.com/hero/89704', 'https://valor.militarytimes.com/hero/89703', 'https://valor.militarytimes.com/hero/89702', 'https://valor.militarytimes.com/hero/89701', 'https://valor.militarytimes.com/hero/89698', 'https://valor.militarytimes.com/hero/89673', 'https://valor.militarytimes.com/hero/89661', 'https://valor.militarytimes.com/hero/90127', 'https://valor.militarytimes.com/hero/89535', 'https://valor.militarytimes.com/hero/89493', 'https://valor.militarytimes.com/hero/89406', 'https://valor.militarytimes.com/hero/89405', 'https://valor.militarytimes.com/hero/89404', 'https://valor.militarytimes.com/hero/89261', 'https://valor.militarytimes.com/hero/89259', 'https://valor.militarytimes.com/hero/88805', 'https://valor.militarytimes.com/hero/88803', 'https://valor.militarytimes.com/hero/88789', 'https://valor.militarytimes.com/hero/88770', 'https://valor.militarytimes.com/hero/88766', 'https://valor.militarytimes.com/hero/88765', 'https://valor.militarytimes.com/hero/88719', 'https://valor.militarytimes.com/hero/88680', 'https://valor.militarytimes.com/hero/88679', 'https://valor.militarytimes.com/hero/88678', 'https://valor.militarytimes.com/hero/88658', 'https://valor.militarytimes.com/hero/88657', 'https://valor.militarytimes.com/hero/88616', 'https://valor.militarytimes.com/hero/88578', 'https://valor.militarytimes.com/hero/88551', 'https://valor.militarytimes.com/hero/88445', 'https://valor.militarytimes.com/hero/88366', 'https://valor.militarytimes.com/hero/88365', 'https://valor.militarytimes.com/hero/88045', 'https://valor.militarytimes.com/hero/88044', 'https://valor.militarytimes.com/hero/88013', 'https://valor.militarytimes.com/hero/88012', 'https://valor.militarytimes.com/hero/87986', 'https://valor.militarytimes.com/hero/87918', 'https://valor.militarytimes.com/hero/87909', 'https://valor.militarytimes.com/hero/87898', 'https://valor.militarytimes.com/hero/87830', 'https://valor.militarytimes.com/hero/88570', 'https://valor.militarytimes.com/hero/88568', 'https://valor.militarytimes.com/hero/88239', 'https://valor.militarytimes.com/hero/87792', 'https://valor.militarytimes.com/hero/87782', 'https://valor.militarytimes.com/hero/87677', 'https://valor.militarytimes.com/hero/87655', 'https://valor.militarytimes.com/hero/87523', 'https://valor.militarytimes.com/hero/87460', 'https://valor.militarytimes.com/hero/87292', 'https://valor.militarytimes.com/hero/87291', 'https://valor.militarytimes.com/hero/87288', 'https://valor.militarytimes.com/hero/87283', 'https://valor.militarytimes.com/hero/87282', 'https://valor.militarytimes.com/hero/87281', 'https://valor.militarytimes.com/hero/87280', 'https://valor.militarytimes.com/hero/87279', 'https://valor.militarytimes.com/hero/87272', 'https://valor.militarytimes.com/hero/86875', 'https://valor.militarytimes.com/hero/86811', 'https://valor.militarytimes.com/hero/86451', 'https://valor.militarytimes.com/hero/86077', 'https://valor.militarytimes.com/hero/86076', 'https://valor.militarytimes.com/hero/85994', 'https://valor.militarytimes.com/hero/86005', 'https://valor.militarytimes.com/hero/6190', 'https://valor.militarytimes.com/hero/5022', 'https://valor.militarytimes.com/hero/500877', 'https://valor.militarytimes.com/hero/500851', 'https://valor.militarytimes.com/hero/500844', 'https://valor.militarytimes.com/hero/500843', 'https://valor.militarytimes.com/hero/500842', 'https://valor.militarytimes.com/hero/500841', 'https://valor.militarytimes.com/hero/500840', 'https://valor.militarytimes.com/hero/500839', 'https://valor.militarytimes.com/hero/500838', 'https://valor.militarytimes.com/hero/500837', 'https://valor.militarytimes.com/hero/500836', 'https://valor.militarytimes.com/hero/500835', 'https://valor.militarytimes.com/hero/500834', 'https://valor.militarytimes.com/hero/500833', 'https://valor.militarytimes.com/hero/500832', 'https://valor.militarytimes.com/hero/500831', 'https://valor.militarytimes.com/hero/500830', 'https://valor.militarytimes.com/hero/500829', 'https://valor.militarytimes.com/hero/500827', 'https://valor.militarytimes.com/hero/500826', 'https://valor.militarytimes.com/hero/500817', 'https://valor.militarytimes.com/hero/500816', 'https://valor.militarytimes.com/hero/500815', 'https://valor.militarytimes.com/hero/500813', 'https://valor.militarytimes.com/hero/500808', 'https://valor.militarytimes.com/hero/401188', 'https://valor.militarytimes.com/hero/401185', 'https://valor.militarytimes.com/hero/89851', 'https://valor.militarytimes.com/hero/89846']
You can use both selenium webdriver & beautiful soup
from selenium import webdriver
import time
from bs4 import BeautifulSoup
url = 'https://valor.militarytimes.com/award/5?page=1'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('window-size=1920x1080');
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(10)
page=driver.page_source
soup=BeautifulSoup(page,'lxml')
items = soup.select('a',href=True)
hero=[]
for item in items:
if 'hero' in item['href']:
print(item['href'])
hero.append(item['href'])
print(hero)
Output:
/hero/500963
/hero/500962
/hero/500961
/hero/500941
/hero/94465
/hero/94175
/hero/92498
/hero/500944
/hero/500943
/hero/500942
/hero/314466
/hero/102316
/hero/89255
/hero/71533
/hero/500938
/hero/500937
/hero/500935
/hero/500934
/hero/500933
/hero/500932
/hero/500931
/hero/500930
/hero/500929
/hero/500927
['/hero/500963', '/hero/500962', '/hero/500961', '/hero/500941', '/hero/94465', '/hero/94175', '/hero/92498', '/hero/500944', '/hero/500943', '/hero/500942', '/hero/314466', '/hero/102316', '/hero/89255', '/hero/71533', '/hero/500938', '/hero/500937', '/hero/500935', '/hero/500934', '/hero/500933', '/hero/500932', '/hero/500931', '/hero/500930', '/hero/500929', '/hero/500927']
You can make POST requests to API to retrieve json containing the ids for each recipient you can concatenate onto a base url to give the full url for each recipient. The json contains the url of the last page so you can determine the end point for a subsequent loop over all pages.
import requests
import pandas as pd
baseUrl = 'https://valor.militarytimes.com/hero/'
url = 'https://valor.militarytimes.com/api/awards/5?page=1'
headers = {
'Accept' : 'application/json, text/plain, */*' ,
'Referer' : 'https://valor.militarytimes.com/award/5?page=1',
'User-Agent' : 'Mozilla/5.0'
}
info = requests.post(url, headers = headers, data = '').json()
urls = [baseUrl + str(item['recipient']['id']) for item in info['data']] #page 1
linksInfo = info['links']
firstLink = linksInfo['first']
lastLink = linksInfo['last']
lastPage = lastLink.replace('https://valor.militarytimes.com/api/awards/5?page=','')
print('last page = ' + lastPage)
print(urls)
I had been testing with retrieving all results and noticed you would need potentially back off and retry.
You can build the additional urls as follows:
if lastPage > 1:
for page in range(2, lastPage + 1):
url = 'https://valor.militarytimes.com/api/awards/5?page={}'.format(page)

Scrapy or BeautifulSoup to scrape links and text from various websites

I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')
You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)
code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

Categories

Resources