How to get href's URL by using BeautifulSoup? - python

Could you tell me how you get href's URL in the case below? I would appreciate it if you could kindly show your code to get URLs from href.
import requests as r
from bs4 import BeautifulSoup
url = r.get('https://jen.jiji.com/')
soup = BeautifulSoup(url.text, "html.parser")
for el in soup.find_all('li', attrs={ 'class' : 'arrow03'}):
want = el.find_all('a')
link = el.get('href', 'no URLs')
print(want)
print(link)
Cuurently, I could get tags through print(want), but couldn't through print(link).
[(Update) Japan Daily COVID-19 Cases Halve to 96,000]
no URLs
[2 Universities to Be Merged as "Institute of Science Tokyo"]
no URLs
[(Update) Kishida Cabinet Approval Hits Record-Low 26.5 Pct: Jiji Poll]
no URLs
[Japan Team Develops Black Sheet Absorbing 99.98 Pct of Visible Light]
no URLs
[Tokyo Confirms 7,719 New COVID-19 Cases]
no URLs
[Solaseed Air to Fly Pokemon Jet from March]
no URLs
[Coinbase Announces Halt of Japan Operations]
no URLs
[Japan 2022 Used Auto Sales Hit Record Low]
no URLs
[Osaka Gas Signs LNG Deal with New Sakhalin-2 Operator]
no URLs
[5 Japan Drug Wholesalers to Be Fined for Bid-Rigging]
no URLs
[Foreigners Turn Net Buyers of Japan Stocks Last Week]
no URLs
[Dollar Weaker around 128.20 Yen in Late Tokyo]
no URLs
[OSE Nikkei 225 Futures (Closing)]
no URLs
[Tokyo Stocks Slide on Yen's Bounce, Wall Street Drop]
no URLs
[Nikkei Average/TOPIX Index (Closing)]
no URLs
Please give me soutions or comments

You are close to your goal but have to iterate the ResultSet of want to get it - Would recommend to select your elements in an alternative way with css selectors:
for el in soup.select('li.arrow03 a'):
link = 'https://jen.jiji.com'+el.get('href')
print(link)
Because links are relative you have to prepend https://jen.jiji.com
Output
https://jen.jiji.com/jc/eng?g=eco&k=2023011901044
https://jen.jiji.com/jc/eng?g=eco&k=2023011900863
https://jen.jiji.com/jc/eng?g=eco&k=2023011900878
https://jen.jiji.com/jc/eng?g=eco&k=2023011900770
https://jen.jiji.com/jc/eng?g=eco&k=2023011900850
https://jen.jiji.com/jc/eng?g=ind&k=2023011900956
https://jen.jiji.com/jc/eng?g=ind&k=2023011801016
https://jen.jiji.com/jc/eng?g=ind&k=2023011800979
https://jen.jiji.com/jc/eng?g=ind&k=2023011700996
https://jen.jiji.com/jc/eng?g=ind&k=2023011700892
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900934
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900924
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900766
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900758
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900697

Related

need to extract link and text from the anchor tag using beautiful soup

I am working on to extract link and text from from anchor tag using beautiful soup
The below code is from where i have to extract the data from anchor tag which is link and the text
Mumbai: Vaccination figures surge in private hospitals, stagnate in government centres
Chennai: Martial arts instructor arrested following allegations of sexual assault
Mumbai Metro lines 2A and 7: Here is everything you need to know
**Python code to extract the content from the above code.**
#app.get('/indian_express', response_class=HTMLResponse)
async def dna_india(request: Request):
print("1111111111111111")
dict={}
URL="https://indianexpress.com/latest-news/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
results=soup.find_all('div', class_="nation")
for results_element in results:
results_element_1 = soup.find_all('div', class_="title")
for results_element_2 in results_element_1:
for results_element_3 in results_element_2:
print(results_element_3) **The above printed html code is because of this print**
print(" ")
link_element=results_element_3.find_all('a', class_="title", href=True) **I am getting empty [] when i try to print here **
# print(link_element)
# title_elem = results_element_3.find('a')['href']
# link_element=results_element_3.find('a').contents[0]
# print(title_elem)
# print(link_element)
# for index,(title,link) in enumerate(zip(title_elem, link_element)):
# dict[str(title.text)]=str(link['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
#app.get('/deccan_chronicle', response_class=HTMLResponse)
async def deccan_chronicle(request: Request):
dict={}
URL="https://www.news18.com/india/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
main_div = soup.find("div", class_="blog-list")
for i in main_div:
#link_data = i.find("div", class_="blog-list-blog").find("a")
link_data=i.find("div",class_="blog-list-blog").find("a")
text_data = link_data.text
dict[str(text_data)] = str(link_data.attrs['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
Please help me out with this code
You can find main_div tag which has all the records of news in which you can find articles where all data is defined and iterating over that articles title can be extract using finding proper a tag which contain title as well as herf of same!
import requests
from bs4 import BeautifulSoup
res=requests.get("https://indianexpress.com/latest-news/")
soup=BeautifulSoup(res.text,"html.parser")
main_div=soup.find("div",class_="nation")
articles=main_div.find_all("div",class_="articles")
for i in articles:
href=i.find("div",class_="title").find("a")
print(href.attrs['href'])
text_data=href.text
print(text_data)
Output:
https://indianexpress.com/article/business/banking-and-finance/banks-cant-cite-2018-rbi-circular-to-caution-clients-on-virtual-currencies-7338628/
Banks can’t cite 2018 RBI circular to caution clients on virtual currencies
https://indianexpress.com/article/india/supreme-court-stays-delhi-high-court-order-on-levy-of-igst-on-imported-oxygen-concentrators-for-personal-use-7339478/
Supreme Court stays Delhi High Court order on levy of IGST on imported oxygen concentrators for personal use
...
2nd Method
Dont make so complex just observe tags what data they contain like i have found main tag main_div and then go for tag which contains text as well as links and you can find it in h4 tag and iterate over it !
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.news18.com/india/")
soup=BeautifulSoup(res.text,"html.parser")
main_div = soup.find("div", class_="blog-list")
data=main_div.find_all("h4")
for i in data:
print(i.find("a")['href'])
print(i.find("a").text)
output:
https://www.news18.com/news/india/2-killed-six-injured-after-portion-of-two-storey-building-collapses-in-varanasi-pm-assures-help-3799610.html
2 Killed, Six Injured After Portion of Two-Storey Building Collapses in Varanasi; PM Assures Help
https://www.news18.com/news/india/dont-compel-citizens-to-move-courts-again-again-follow-national-litigation-policy-hc-tells-centre-3799598.html
Don't Compel Citizens to Move Courts Again & Again, Follow National Litigation Policy, HC Tells Centre
...

how to extract links from a website and extract its content in web scraping using python

I am trying to extract data from a website using beautifulSoup and requests packages
where I want to extract the links and it contents .
Until now I am bale to extract the list of the links that exist on a defined url but I do not know how to enter each link and extract the text.
the image below describe my problem :
The text and the image are the link for the hall article.
code:
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(news):
published_date = new.find('span',class_="article__time-stamp").text
title = new.find('h3',class_="article__title").text
link = new.find('a',class_="article__link").attrs['href']
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
result :
publish_date:
06-10-2020 | 20:53
title:
18 killed in bombing in Turkish-controlled Syrian town
link: https://www.annahar.com/english/section/186-mena/06102020061027020
My question is how to continue from here in order to enter each link and extract its content?
the expected result :
publish_date:
06-10-2020 | 20:53
title:
18 killed in bombing in Turkish-controlled Syrian town
link: https://www.annahar.com/english/section/186-mena/06102020061027020
description:
ANKARA: An explosives-laden truck ignited Tuesday on a busy street in a northern #Syrian town controlled by #Turkey-backed opposition fighters, killing at least 18 people and wounding dozens, Syrian opposition activists reported.
The blast in the town of al-Bab took place near a bus station where people often gather to travel from one region to another, according to the opposition’s Civil Defense, also known as White Helmets.
where the description exist inside the link
Add an additional request to your loop that gets to the article page and there grab the description
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" description: {description}")
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
# print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(soup.select('div#listingDiv44083 div.article')):
published_date = new.find('span',class_="article__time-stamp").get_text(strip=True)
title = new.find('h3',class_="article__title").get_text(strip=True)
link = new.find('a',class_="article__link").attrs['href']
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
print(f" description: {description}", '\n')
You have to grab all follow links to the articles and then loop over that and grab the parts you're interested in.
Here's how:
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.annahar.com/english/section/186-mena").content,
"lxml"
)
follow_links = [
a["href"] for a in
soup.find_all("a", class_="article__link")
if "#" not in a["href"]
]
for link in follow_links:
s = BeautifulSoup(requests.get(link).content, "lxml")
date_published = s.find("span", class_="date").getText(strip=True)
title = s.find("h1", class_="article-main__title").getText(strip=True)
article_body = s.find("div", {"id": "bodyToAddTags"}).getText()
print(f"{date_published} {title}\n\n{article_body}\n", "-" * 80)
time.sleep(2)
Output (shortened for brevity):
08-10-2020 | 12:35 Iran frees rights activist after more than 8 years in prison
TEHRAN: Iran has released a prominent human rights activist who campaigned against the death penalty, Iranian media reported Thursday.The semiofficial ISNA news agency quoted judiciary official Sadegh Niaraki as saying that Narges Mohammadi was freed late Wednesday after serving 8 1/2 years in prison. She was sentenced to 10 years in 2016 while already incarcerated.Niaraki said Mohammadi was released based on a law that allows a prison sentence to be commutated if the related court agrees.In July, rights group Amnesty International demanded Mohammadi’s immediate release because of serious pre-existing health conditions and showing suspected COVID-19 symptoms. The Thursday report did not refer to her possible illness.Mohammadi was sentenced in Tehran’s Revolutionary Court on charges including planning crimes to harm the security of Iran, spreading propaganda against the government and forming and managing an illegal group.She was in a prison in the northwestern city of Zanjan, some 280 kilometers (174 miles) northwest of the capital Tehran.Mohammadi was close to Iranian Nobel Peace Prize laureate Shirin Ebadi, who founded the banned Defenders of Human Rights Center. Ebadi left Iran after the disputed re-election of then-President Mahmoud Ahmadinejad in 2009, which touched off unprecedented protests and harsh crackdowns by authorities.In 2018, Mohammadi, an engineer and physicist, was awarded the 2018 Andrei Sakharov Prize, which recognizes outstanding leadership or achievements of scientists in upholding human rights.
--------------------------------------------------------------------------------
...

Beautiful soup: Extract text and urls from a list, but only under specific headings

I am looking to use Beautiful Soup to scrape the Fujitsu news update page: https://www.fujitsu.com/uk/news/pr/2020/
I only want to extract the information under the headings of the current month and previous month.
For a particular month (e.g. November), I am trying to extract into a list
the Title
the URL
the text
for each news briefing (so a list of lists).
My attempt so far is as follow (showing only previous month for simplicity):
today = datetime.datetime.today()
year_str = str(today.year)
current_m = today.month
previous_m = current_m - 1
current_m_str = calendar.month_name[current_m]
previous_m_str = calendar.month_name[previous_m]
URL = 'https://www.fujitsu.com/uk/news/pr/' + year_str + '/'
resp = requests.get(URL)
soup = BeautifulSoup(resp.text, 'lxml')
previous_m_body = soup.find('h3', text=previous_m_str)
if previous_m_body is not None:
for sib in previous_m_body.find_next_siblings():
if sib.name == "h3":
break
else:
previous_m_text = str(sib.text)
print(previous_m_text)
However, this generates one long string with newlines, and no separation between Title, text, url:
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution London, United Kingdom, November 30, 2020 - Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local...
Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive Loads Hoofddorp, EMEA, November 11, 2020 - Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring.......
I have attached an image of the page DOM.
Try this:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.fujitsu.com/uk/news/pr/2020/").text
all_lists = BeautifulSoup(html, "html.parser").find_all("ul", class_="filterlist")
news = []
for unordered_list in all_lists:
for list_item in unordered_list.find_all("li"):
news.append(
[
list_item.find("a").getText(),
f"https://www.fujitsu.com{list_item.find('a')['href']}",
list_item.getText(strip=True)[len(list_item.find("a").getText()):],
]
)
for news_item in news:
print("\n".join(news_item))
print("-" * 80)
Output (shortened for brevity):
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution
https://www.fujitsu.com/uk/news/pr/2020/fs-20201130.html
London, United Kingdom, November 30, 2020- Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local Authorities to support the electronic counting (e-Counting) of ballot papers at the Scottish Local Government elections in May 2022.Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive LoadsHoofddorp, EMEA, November 11, 2020- Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring a switching capacity of 50A at 14VDC. The FBR53-HC offers a higher contact rating than its 40A FBR53-HW counterpart, yet occupies the same 12.1 x 15.5 x 13.7mm footprint and weighs the same 6g.
--------------------------------------------------------------------------------
and more ...
EDIT:
To get just the last two months, all you need is the first two ul items from the soup. So, add [:2] to the first for loop, like this:
for unordered_list in all_lists[:2]:
# the rest of the loop body goes here
here I modified your code. I combined your bs4 code with selenium. Selenium is very powerful for scrape dynamic or JavaScript based website. You can use selenium with BeautifulSoup for make your life easier. Now it will give you output for all months.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.maximize_window()
url = "https://www.fujitsu.com/uk/news/pr/2020/" #change the url if you want to get result for different year
driver.get(url)
# now your bs4 code start. It will give you output from current month to previous all month
soup = BeautifulSoup(driver.page_source, "html.parser")
#here I am getting all month name from January to november.
months = soup.find_all('h3')
for month in months:
month = month.text
print(f"month_name : {month}\n")
#here we are getting all description text from current month to all previous months
description_texts = soup.find_all('ul',class_='filterlist')
for description_text in description_texts:
description_texts = description_text.text.replace('\n','')
print(f"description_text: {description_texts}")
output:

Generating URL for Yahoo news and Bing news with Python and BeautifulSoup

I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.
I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404
Can you please help me with it?
This is the code for 'Bing'
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
And this is for Yahoo:
term = 'usa'
url = 'http://news.search.yahoo.com/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Please help me to generate these urls, whats the logic behind them, Im still a noob :)
Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).
Bing
from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
title = news_card.find('a', class_="title").text
time = news_card.find(
'span',
attrs={'aria-label': re.compile(".*ago$")}
).text
print("{} ({})".format(title, time))
Output
Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...
Yahoo
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
title = news_item.find('h4').text
time = news_item.find('span', class_='fc-2nd').text
# Clean time text
time = time.replace('·', '').strip()
print("{} ({})".format(title, time))
Output
USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...

Scrapy or BeautifulSoup to scrape links and text from various websites

I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')
You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)
code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

Categories

Resources