BeautifulSoup: Get text, create dictionary - python

I'm scraping information on central bank research publications, So far, for the Federal Reserve, I've the following Python code:
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
print(paper.text)
This produces the following for the first, of many, publications:
2018-070 Reliably Computing Nonlinear Dynamic Stochastic Model
Solutions: An Algorithm with Error Formulasby Gary S. Anderson
I now want to convert this into a Python dictionary, which will eventually contain a large number of papers:
Papers = {
'Date': 2018 - 070,
'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
'Author/s': 'Gary S. Anderson'
}

I get good results extracting all the descendants and pick only those that are NavigableStrings. Make sure to import NavigableString from bs4. I also use a numpy list comprehension but you could use for-loops as well.
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
papers = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})
print(papers[1])
{'Date': '2018-069',
'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

You could use regex to match each part of string.
[-\d]+ the string only have number and -
(?<=\s).*?(?=by) the string start with blank and end with by(which is begin with author)
(?<=by\s).* the author, the rest of whole string
Full code
import requests
from bs4 import BeautifulSoup
import re
START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
data = dict()
data["date"] = re.findall(r"[-\d]+",paper.text)[0]
data["Title"] = re.findall(r"(?<=\s).*?(?=by)",paper.text)[0]
data["Author(s)"] = re.findall(r"(?<=by\s).*",paper.text)[0]
print(data)
datas.append(data)

Related

need to extract link and text from the anchor tag using beautiful soup

I am working on to extract link and text from from anchor tag using beautiful soup
The below code is from where i have to extract the data from anchor tag which is link and the text
Mumbai: Vaccination figures surge in private hospitals, stagnate in government centres
Chennai: Martial arts instructor arrested following allegations of sexual assault
Mumbai Metro lines 2A and 7: Here is everything you need to know
**Python code to extract the content from the above code.**
#app.get('/indian_express', response_class=HTMLResponse)
async def dna_india(request: Request):
print("1111111111111111")
dict={}
URL="https://indianexpress.com/latest-news/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
results=soup.find_all('div', class_="nation")
for results_element in results:
results_element_1 = soup.find_all('div', class_="title")
for results_element_2 in results_element_1:
for results_element_3 in results_element_2:
print(results_element_3) **The above printed html code is because of this print**
print(" ")
link_element=results_element_3.find_all('a', class_="title", href=True) **I am getting empty [] when i try to print here **
# print(link_element)
# title_elem = results_element_3.find('a')['href']
# link_element=results_element_3.find('a').contents[0]
# print(title_elem)
# print(link_element)
# for index,(title,link) in enumerate(zip(title_elem, link_element)):
# dict[str(title.text)]=str(link['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
#app.get('/deccan_chronicle', response_class=HTMLResponse)
async def deccan_chronicle(request: Request):
dict={}
URL="https://www.news18.com/india/"
page=requests.get(URL)
soup=BS(page.content, 'html.parser')
main_div = soup.find("div", class_="blog-list")
for i in main_div:
#link_data = i.find("div", class_="blog-list-blog").find("a")
link_data=i.find("div",class_="blog-list-blog").find("a")
text_data = link_data.text
dict[str(text_data)] = str(link_data.attrs['href'])
json_compatible_item_data = jsonable_encoder(dict)
return templates.TemplateResponse("display.html", {"request":request, "json_data":json_compatible_item_data})
Please help me out with this code
You can find main_div tag which has all the records of news in which you can find articles where all data is defined and iterating over that articles title can be extract using finding proper a tag which contain title as well as herf of same!
import requests
from bs4 import BeautifulSoup
res=requests.get("https://indianexpress.com/latest-news/")
soup=BeautifulSoup(res.text,"html.parser")
main_div=soup.find("div",class_="nation")
articles=main_div.find_all("div",class_="articles")
for i in articles:
href=i.find("div",class_="title").find("a")
print(href.attrs['href'])
text_data=href.text
print(text_data)
Output:
https://indianexpress.com/article/business/banking-and-finance/banks-cant-cite-2018-rbi-circular-to-caution-clients-on-virtual-currencies-7338628/
Banks can’t cite 2018 RBI circular to caution clients on virtual currencies
https://indianexpress.com/article/india/supreme-court-stays-delhi-high-court-order-on-levy-of-igst-on-imported-oxygen-concentrators-for-personal-use-7339478/
Supreme Court stays Delhi High Court order on levy of IGST on imported oxygen concentrators for personal use
...
2nd Method
Dont make so complex just observe tags what data they contain like i have found main tag main_div and then go for tag which contains text as well as links and you can find it in h4 tag and iterate over it !
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.news18.com/india/")
soup=BeautifulSoup(res.text,"html.parser")
main_div = soup.find("div", class_="blog-list")
data=main_div.find_all("h4")
for i in data:
print(i.find("a")['href'])
print(i.find("a").text)
output:
https://www.news18.com/news/india/2-killed-six-injured-after-portion-of-two-storey-building-collapses-in-varanasi-pm-assures-help-3799610.html
2 Killed, Six Injured After Portion of Two-Storey Building Collapses in Varanasi; PM Assures Help
https://www.news18.com/news/india/dont-compel-citizens-to-move-courts-again-again-follow-national-litigation-policy-hc-tells-centre-3799598.html
Don't Compel Citizens to Move Courts Again & Again, Follow National Litigation Policy, HC Tells Centre
...

BeautifulSoup scrape the first title tag in each <li>

I have some code that goes through the cast list of a show or movie on Wikipedia. Scraping all the actor's names and storing them. The current code I have finds all the <a> in the list and stores their title tags. It currently goes:
from bs4 import BeautifulSoup
URL = input()
website_url = requests.get(URL).text
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').find_all('a'):
title = x.get('title')
print (title)
if title is not None:
Stars.append(title)
else:
continue
While this partially works there are two downsides:
It doesn't work if the actor doesn't have a Wikipedia page hyperlink.
It also scrapes any other hyperlink title it finds. e.g. https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull returns ['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']
Is there a way I can get BeautifulSoup to scrape the first two Words after each <li>? Or even a better solution for what I am trying to do?
You can use css selectors to grab only the first <a> in a <li>:
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Example
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Stars.append(x.get('title'))
Stars
Output
['Harrison Ford',
'Cate Blanchett',
'Karen Allen',
'Ray Winstone',
'John Hurt',
'Jim Broadbent',
'Shia LaBeouf']
You can use Regex to fetch all the names from the text content of <li/> and just take the first two names and it will also fix the issue in case the actor doesn't have a Wikipedia page hyperlink
import re
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)", <text_content_from_li>)
Example:
text = "Cate Blanchett as Irina Spalko, a villainous Soviet agent. Screenwriter David Koepp created the character."
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)",text)
Output:
[('Cate', 'Blanchett'), ('Irina', 'Spalko'), ('Screenwriter', 'David')]
There is considerable variation for the html for cast within the film listings on Wikipaedia. Perhaps look to an API to get this info?
E.g. imdb8 allows for a reasonable number of calls which you could use with the following endpoint
https://imdb8.p.rapidapi.com/title/get-top-cast
There also seems to be Python IMDb API
Or choose something with more regular html. For example, if you take the imdb film ids in a list you can extract full cast and main actors, from IMDb as follows. To get the shorter cast list I am filtering out the rows which occur at/after the text "Rest" within "Rest of cast listed alphabetically:"
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
movie_ids = ['tt0367882', 'tt7126948']
base = 'https://www.imdb.com'
with requests.Session() as s:
for movie_id in movie_ids:
link = f'https://www.imdb.com/title/{movie_id}/fullcredits?ref_=tt_cl_sm'
# print(link)
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
full_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list [href*=name]:has(img)')]
main_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list tr:not(:has(.castlist_label:contains(cast)) ~ tr, :has(.castlist_label:contains(cast))) [href*=name]:has(img)')]
df_full = pd.DataFrame(full_cast, columns = ['Actor', 'Link'])
df_main = pd.DataFrame(main_cast, columns = ['Actor', 'Link'])
# print(df_full)
print(df_main)

How to extract player names using Python with BeautifulSoup from cricinfo

I'm learning beautiful soup. I want to extract the player names i.e. the playing eleven for both teams from cricinfo.com. The exact link is "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
The problem is that the website only displays the players under class "wrap batsmen" if they have batted. Otherwise they are placed under the class "wrap dnb". I want to extract all the players irrespective of whether they have batted or not. How I can maintain two arrays (one for each team) that will dynamically search for players in "wrap batsmen" and "wrap dnb" (if required)?
This is my attempt:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
years = []
# Years we will be analyzing
for i in range(2010, 2018):
years.append(i)
names = []
# URL page we will scraping (see image above)
url = "https://www.espncricinfo.com/series/13266/scorecard/439146/west-indies-vs-south-africa-1st-t20i-south-africa-tour-of-west-indies-2010"
# this is the HTML from the given URL
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")
for a in range(0, 1):
names.append([a.getText() for a in soup.find_all("div", class_="cell batsmen")[1:][a].findAll('a', limit=1)])
soup = soup.find_all("div", class_="wrap dnb")
print(soup[0])
While this is possible with BeautifulSoup, it's not the best tool for the job. All that data (and much more) is available through the API. Simply pull that and then you can parse the json to get what you want (and more). Here's a quick script though to get the 11 players for each team:
You can get the api url by using dev tools (Ctrl-Shft-I) and seeing what requests the browser makes (look at Network -> XHR in the side panel. you may need to click around to view it make the request/call)
import requests
url = 'https://site.web.api.espn.com/apis/site/v2/sports/cricket/13266/summary'
payload = {
'contentorigin': 'espn',
'event': '439146',
'lang': 'en',
'region': 'gb',
'section': 'cricinfo'}
jsonData = requests.get(url, params=payload).json()
roster = jsonData['rosters']
players = {}
for team in roster:
players[team['team']['displayName']] = []
for player in team['roster']:
playerName = player['athlete']['displayName']
players[team['team']['displayName']].append(playerName)
Output:
print (players)
{'West Indies': ['Chris Gayle', 'Andre Fletcher', 'Dwayne Bravo', 'Ramnaresh Sarwan', 'Narsingh Deonarine', 'Kieron Pollard', 'Darren Sammy', 'Nikita Miller', 'Jerome Taylor', 'Sulieman Benn', 'Kemar Roach'], 'South Africa': ['Graeme Smith', 'Loots Bosman', 'Jacques Kallis', 'AB de Villiers', 'Jean-Paul Duminy', 'Johan Botha', 'Alviro Petersen', 'Ryan McLaren', 'Roelof van der Merwe', 'Dale Steyn', 'Charl Langeveldt']}
See below:

BeautifulSoup Python

I'm scraping a news article using BeautifulSoup trying to only return the text body of the article itself, not all the additional "noise". Is there any easy way to do this?
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
element = soup.select_one('div.pg-rail-tall__body #body-text').text
print(element)
Trying to exclude some of the information returned such as
{CNN.VideoPlayer.handleUnmutePlayer = function
handleUnmutePlayer(containerId, dataObj) {'use strict';var
playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector
= 'unmute_' +
The noise, as you call it, is the text in the <script>...</script> tags (JavaScript code). You can remove it using .extract() like:
for s in soup.find_all('script'):
s.extract()
You can use this:
r = requests.get('https://edition.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
[x.extract() for x in soup.find_all('script')] # Does the same thing as the 'for-loop' above
element = soup.find('div', class_='pg-rail-tall__body')
print(element.text)
Partial Output:
(CNN)Puerto Rico Gov. Ricardo Rosselló announced Monday that the
commonwealth will begin privatizing the Puerto Rico Electric Power
Authority, or PREPA. In comments published on Twitter, the governor
said the assets sale would transform the island's power generation
system into a "modern" and "efficient" one that would be less
expensive for citizens.He said the system operates "deficiently" and
that the improved infrastructure would respond more "agilely" to
natural disasters. The privatization process will begin "in the next
few days" and occur in three phases over the next 18 months, the
governor said.JUST WATCHEDSchool cheers as power returns after 112
daysReplayMore Videos ...MUST WATCHSchool cheers as power returns
after 112 days 00:48San Juan Mayor Carmen Yulin Cruz, known for her
criticisms of the Trump administration's response to Puerto Rico after
Hurricane Maria, spoke out against the move.Cruz, writing on her
official Twitter account, said PREPA's privatization would put the
commonwealth's economic development into "private hands" and that the
power authority will begin to "serve other interests.
Try this:
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-au$'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elementd = soup.findAll('div', {'class': 'zn-body__paragraph'})
elementp = soup.findAll('p', {'class': 'zn-body__paragraph'})
for i in elementp:
print(i.text)
for i in elementd:
print(i.text)

Scrapy or BeautifulSoup to scrape links and text from various websites

I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')
You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)
code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

Categories

Resources