Scrapy or BeautifulSoup to scrape links and text from various websites - python

I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')

You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)

code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

Related

How to get href's URL by using BeautifulSoup?

Could you tell me how you get href's URL in the case below? I would appreciate it if you could kindly show your code to get URLs from href.
import requests as r
from bs4 import BeautifulSoup
url = r.get('https://jen.jiji.com/')
soup = BeautifulSoup(url.text, "html.parser")
for el in soup.find_all('li', attrs={ 'class' : 'arrow03'}):
want = el.find_all('a')
link = el.get('href', 'no URLs')
print(want)
print(link)
Cuurently, I could get tags through print(want), but couldn't through print(link).
[(Update) Japan Daily COVID-19 Cases Halve to 96,000]
no URLs
[2 Universities to Be Merged as "Institute of Science Tokyo"]
no URLs
[(Update) Kishida Cabinet Approval Hits Record-Low 26.5 Pct: Jiji Poll]
no URLs
[Japan Team Develops Black Sheet Absorbing 99.98 Pct of Visible Light]
no URLs
[Tokyo Confirms 7,719 New COVID-19 Cases]
no URLs
[Solaseed Air to Fly Pokemon Jet from March]
no URLs
[Coinbase Announces Halt of Japan Operations]
no URLs
[Japan 2022 Used Auto Sales Hit Record Low]
no URLs
[Osaka Gas Signs LNG Deal with New Sakhalin-2 Operator]
no URLs
[5 Japan Drug Wholesalers to Be Fined for Bid-Rigging]
no URLs
[Foreigners Turn Net Buyers of Japan Stocks Last Week]
no URLs
[Dollar Weaker around 128.20 Yen in Late Tokyo]
no URLs
[OSE Nikkei 225 Futures (Closing)]
no URLs
[Tokyo Stocks Slide on Yen's Bounce, Wall Street Drop]
no URLs
[Nikkei Average/TOPIX Index (Closing)]
no URLs
Please give me soutions or comments
You are close to your goal but have to iterate the ResultSet of want to get it - Would recommend to select your elements in an alternative way with css selectors:
for el in soup.select('li.arrow03 a'):
link = 'https://jen.jiji.com'+el.get('href')
print(link)
Because links are relative you have to prepend https://jen.jiji.com
Output
https://jen.jiji.com/jc/eng?g=eco&k=2023011901044
https://jen.jiji.com/jc/eng?g=eco&k=2023011900863
https://jen.jiji.com/jc/eng?g=eco&k=2023011900878
https://jen.jiji.com/jc/eng?g=eco&k=2023011900770
https://jen.jiji.com/jc/eng?g=eco&k=2023011900850
https://jen.jiji.com/jc/eng?g=ind&k=2023011900956
https://jen.jiji.com/jc/eng?g=ind&k=2023011801016
https://jen.jiji.com/jc/eng?g=ind&k=2023011800979
https://jen.jiji.com/jc/eng?g=ind&k=2023011700996
https://jen.jiji.com/jc/eng?g=ind&k=2023011700892
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900934
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900924
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900766
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900758
https://jen.jiji.com/jc/eng?g=mkt&k=2023011900697

need to extract link and text from the anchor tag using beautiful soup

I am working on to extract link and text from from anchor tag using beautiful soup
The below code is from where i have to extract the data from anchor tag which is link and the text
Mumbai: Vaccination figures surge in private hospitals, stagnate in government centres
Chennai: Martial arts instructor arrested following allegations of sexual assault
Mumbai Metro lines 2A and 7: Here is everything you need to know
**Python code to extract the content from the above code.**
#app.get('/indian_express', response_class=HTMLResponse)
async def dna_india(request: Request):
print("1111111111111111")
dict={}
URL="https://indianexpress.com/latest-news/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
results=soup.find_all('div', class_="nation")
for results_element in results:
results_element_1 = soup.find_all('div', class_="title")
for results_element_2 in results_element_1:
for results_element_3 in results_element_2:
print(results_element_3) **The above printed html code is because of this print**
print(" ")
link_element=results_element_3.find_all('a', class_="title", href=True) **I am getting empty [] when i try to print here **
# print(link_element)
# title_elem = results_element_3.find('a')['href']
# link_element=results_element_3.find('a').contents[0]
# print(title_elem)
# print(link_element)
# for index,(title,link) in enumerate(zip(title_elem, link_element)):
# dict[str(title.text)]=str(link['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
#app.get('/deccan_chronicle', response_class=HTMLResponse)
async def deccan_chronicle(request: Request):
dict={}
URL="https://www.news18.com/india/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
main_div = soup.find("div", class_="blog-list")
for i in main_div:
#link_data = i.find("div", class_="blog-list-blog").find("a")
link_data=i.find("div",class_="blog-list-blog").find("a")
text_data = link_data.text
dict[str(text_data)] = str(link_data.attrs['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
Please help me out with this code
You can find main_div tag which has all the records of news in which you can find articles where all data is defined and iterating over that articles title can be extract using finding proper a tag which contain title as well as herf of same!
import requests
from bs4 import BeautifulSoup
res=requests.get("https://indianexpress.com/latest-news/")
soup=BeautifulSoup(res.text,"html.parser")
main_div=soup.find("div",class_="nation")
articles=main_div.find_all("div",class_="articles")
for i in articles:
href=i.find("div",class_="title").find("a")
print(href.attrs['href'])
text_data=href.text
print(text_data)
Output:
https://indianexpress.com/article/business/banking-and-finance/banks-cant-cite-2018-rbi-circular-to-caution-clients-on-virtual-currencies-7338628/
Banks can’t cite 2018 RBI circular to caution clients on virtual currencies
https://indianexpress.com/article/india/supreme-court-stays-delhi-high-court-order-on-levy-of-igst-on-imported-oxygen-concentrators-for-personal-use-7339478/
Supreme Court stays Delhi High Court order on levy of IGST on imported oxygen concentrators for personal use
...
2nd Method
Dont make so complex just observe tags what data they contain like i have found main tag main_div and then go for tag which contains text as well as links and you can find it in h4 tag and iterate over it !
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.news18.com/india/")
soup=BeautifulSoup(res.text,"html.parser")
main_div = soup.find("div", class_="blog-list")
data=main_div.find_all("h4")
for i in data:
print(i.find("a")['href'])
print(i.find("a").text)
output:
https://www.news18.com/news/india/2-killed-six-injured-after-portion-of-two-storey-building-collapses-in-varanasi-pm-assures-help-3799610.html
2 Killed, Six Injured After Portion of Two-Storey Building Collapses in Varanasi; PM Assures Help
https://www.news18.com/news/india/dont-compel-citizens-to-move-courts-again-again-follow-national-litigation-policy-hc-tells-centre-3799598.html
Don't Compel Citizens to Move Courts Again & Again, Follow National Litigation Policy, HC Tells Centre
...

Generating URL for Yahoo news and Bing news with Python and BeautifulSoup

I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.
I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404
Can you please help me with it?
This is the code for 'Bing'
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
And this is for Yahoo:
term = 'usa'
url = 'http://news.search.yahoo.com/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Please help me to generate these urls, whats the logic behind them, Im still a noob :)
Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).
Bing
from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
title = news_card.find('a', class_="title").text
time = news_card.find(
'span',
attrs={'aria-label': re.compile(".*ago$")}
).text
print("{} ({})".format(title, time))
Output
Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...
Yahoo
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
title = news_item.find('h4').text
time = news_item.find('span', class_='fc-2nd').text
# Clean time text
time = time.replace('·', '').strip()
print("{} ({})".format(title, time))
Output
USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...

BeautifulSoup Python

I'm scraping a news article using BeautifulSoup trying to only return the text body of the article itself, not all the additional "noise". Is there any easy way to do this?
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
element = soup.select_one('div.pg-rail-tall__body #body-text').text
print(element)
Trying to exclude some of the information returned such as
{CNN.VideoPlayer.handleUnmutePlayer = function
handleUnmutePlayer(containerId, dataObj) {'use strict';var
playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector
= 'unmute_' +
The noise, as you call it, is the text in the <script>...</script> tags (JavaScript code). You can remove it using .extract() like:
for s in soup.find_all('script'):
s.extract()
You can use this:
r = requests.get('https://edition.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
[x.extract() for x in soup.find_all('script')] # Does the same thing as the 'for-loop' above
element = soup.find('div', class_='pg-rail-tall__body')
print(element.text)
Partial Output:
(CNN)Puerto Rico Gov. Ricardo Rosselló announced Monday that the
commonwealth will begin privatizing the Puerto Rico Electric Power
Authority, or PREPA. In comments published on Twitter, the governor
said the assets sale would transform the island's power generation
system into a "modern" and "efficient" one that would be less
expensive for citizens.He said the system operates "deficiently" and
that the improved infrastructure would respond more "agilely" to
natural disasters. The privatization process will begin "in the next
few days" and occur in three phases over the next 18 months, the
governor said.JUST WATCHEDSchool cheers as power returns after 112
daysReplayMore Videos ...MUST WATCHSchool cheers as power returns
after 112 days 00:48San Juan Mayor Carmen Yulin Cruz, known for her
criticisms of the Trump administration's response to Puerto Rico after
Hurricane Maria, spoke out against the move.Cruz, writing on her
official Twitter account, said PREPA's privatization would put the
commonwealth's economic development into "private hands" and that the
power authority will begin to "serve other interests.
Try this:
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-au$'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elementd = soup.findAll('div', {'class': 'zn-body__paragraph'})
elementp = soup.findAll('p', {'class': 'zn-body__paragraph'})
for i in elementp:
print(i.text)
for i in elementd:
print(i.text)

BeautifulSoup - return header correspondent to matched footer

I'm using Beautifulsoup to retrieve an artist name from a blog, given a specific match of music tags:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://musicblog.kms-saulgau.de/tag/chillout/')
html = r.content
soup = BeautifulSoup(html, 'html.parser')
Artist names are stored here:
header = soup.find_all('header', class_= "entry-header")
and artist tags here:
span = soup.find_all('span', class_= "tags-links")
I can get all headers:
for each in header:
if each.find("a"):
each = each.find("a").get_text()
print each
And then I'm looking up for 'alternative' and 'chillout' in the same footer:
for each in span:
if each.find("a"):
tags = each.find("a")["href"]
if "alternative" in tags:
print each.get_text()
the code, so far, prints:
Terra Nine – The Heart of the Matter
Emmit Fenn – Blinded
Amparo – The Orchid Glacier
Alpha Minus – Satellites
Carbonates on Mars – The Song of Sol
Josey Marina – Ocean Sighs
Sunday – Only
Some Kind Of Illness – The Light
Vesna Kazensky – Raven
James Lowe – Shallow
Tags Alternative, Chillout, Indie Rock, New tracks
but what I'm trying to do is to return only the entry correspondent to the matched footer, like so:
Some Kind Of Illness – The Light
Alternative, Chillout, Indie Rock, New tracks
how can I achieve that?
for article in soup.find_all('article'):
if article.select('a[href*="alternative"]') and article.select('a[href*="chillout"]'):
print(article.h2.text)
print(article.find(class_='tags-links').text)
out:
Some Kind Of Illness – The Light
Tags Alternative, Chillout, Indie Rock, New tracks

Categories

Resources