Python: How to extract URL from HTML Page using BeautifulSoup? - python

I have a HTML Page with multiple divs like
<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>
<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>
and I need to get the <a href=> value for all the divs with class article-additional-info
I am new to BeautifulSoup
so I need the the urls
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"
What is the best way to achieve this?

According to your criteria, it returns three URLs (not two) - did you want to filter out the third?
Basic idea is to iterate through the HTML, pulling out only those elements in your class, and then iterating through all of the links in that class, pulling out the actual links:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # your HTML
In [3]: soup = BeautifulSoup(html)
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
This limits your search to just those elements with the article-additional-info class tag, and inside of there looks for all anchor (a) tags and grabs their corresponding href link.

After working with the documentation, I did it the following way, thank you all for your answers, I appreciate them
>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
... print link.find('a').attrs['href']
...
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>

from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
for links in text.find_all('a'):
print links.get('href')
Which prints:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

Related

BeautifulSoup: Extracting text from nested tags

Long time lurker, first time poster. I spent some time looking over related questions but I still couldn't seem to figure this out. I think it's easy enough but please forgive me, I'm still a bit of a BeautifulSoup/python n00b.
I have a text file of URLs I parsed from a previous webscraping exercise that I'd like to search through and extract the text contents of a list item (<li>) based on a given keyword. I want to save a csv file of the URL as one column and the corresponding contents from the list item in the second column. In this case, it's albums that I'd like to create a table of by who mastered the album, produced the album, etc.
Given a snippet of html:
from https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
...
<li>
<span class="entity_1XpR8">Recorded By</span>
" – "
EMI Studios, Paris
</li>
<li>
<span class="entity_1XpR8">Mastered At</span>
" – "
Sterling Sound
</li>
etc etc etc
...
My code so far is something like:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
kw = "Mastered At"
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
x = soup.find_all('span', string='Mastered At')
results.append((url, x))
print(results)
df = pd.DataFrame(results)
df.to_csv('mylist1.csv')
With some modifications based on comments below, still having issues:
As you can see I'm trying to do this within a for loop for each link in a list.
The URL list is a simple text file with separate lines for each. Since I'm scraping only one website the sources, class names, and etc should be the same, but the dish will change from page to page.
ex URL list:
https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
https://www.discogs.com/release/3872976-Pink-Floyd-The-Wall
... etc etc etc
updated code snippet:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
print(url)
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Mastered At']:
results.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(),
x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(results, columns=['Url', 'Mastered At', 'Studio'])
print(df)
df.to_csv('studios.csv')
I'm hoping the output in this case is Col 1: (url from txt file); Col 2: "Mastered At — Sterling Sound" (or just "Sterling Sound"), but for each page in the list because these items vary from page to page. I will change the keyword to extract different list items accordingly. In the end I'd like one big spreadsheet with the full list or the url and corresponding item side by side something like below:
example:
album url | Sterling Sound
album url | Abbey Road
album url | Abbey Road
album url | Sterling Sound
album url | Real World Studios
album url | EMI Studios, Paris
album url | Sterling Sound
etc etc etc
Thanks for your help!
Cheers.
The Beautiful Soup library is best suited for this task.
You can use the following code to extract data:
import requests, lxml
from bs4 import BeautifulSoup
# urls.html would be better
with open("urls.txt") as file:
src = file.read()
soup = BeautifulSoup(src, 'lxml')
for first, second in zip(soup.select("li span"), soup.select("li a")):
print(first)
print(second)
To find the desired selector, you can use the select() bs4 method. This method accepts a selector to search for and returns a list of all matched HTML elements.
In this case, I use the zip() built-in function, which allows you to go through two structures at once in one cycle.
Then you can use the data for your tasks.
BeautifulSoup can use different parsers for html. If you have issues with lxml you can try others, like html.parser. You can try the following code, it will create a dataframe from your data, which can then be further saved to csv or other formats:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<li>
<span class = "spClass">Breakfast</span> " — "
Pancakes
</li>
<li>
<span class = "spClass">Lunch</span> " — "
Sandwiches
</li>
<li>
<span class = "spClass">Dinner</span> " — "
Stew
</li>
'''
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in soup.select('li'):
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
print(df) ## you can save the dataframe as csv like so: df.to_csv('foods.csv')
Result:
Url Food Type
0 /examplepage/Pancakes Pancakes Breakfast
1 /examplepage/Sandwiches Sandwiches Lunch
2 /examplepage/Stew Stew Dinner
EDIT: If you only want to extract specific li tags, as per your comment, you can do:
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Dinner']:
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
And this will return:
Url Food Type
0 /examplepage/Stew Stew Dinner

BeautifulSoup how to use for loops and extract specific data?

The HTML code below is from a website regarding movie reviews. I want to extract the Stars from the code below, which would be John C. Reilly, Sarah Silverman and Gal Gadot. How could I do this?
Code:
html_doc = """
<html>
<head>
</head>
<body>
<div class="credit_summary_item">
<h4 class="inline">Stars:</h4>
John C. Reilly,
Sarah Silverman,
Gal Gadot
<span class="ghost">|</span>
See full cast & crew »
</div>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
My idea
I was going to use for loops to iterate through each div class until I found the class with text Stars, in which I could then extract the names. But I don't how I would code this as I am not too familiar with HTML syntax nor the module.
You can iterate over all a tags in the credit_summary_item div:
from bs4 import BeautifulSoup as soup
*results, _ = [i.text for i in soup(html_doc, 'html.parser').find('div', {'class':'credit_summary_item'}).find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
Edit:
_d = [i for i in soup(html_doc, 'html.parser').find_all('div', {'class':'credit_summary_item'}) if 'Stars:' in i.text][0]
*results, _ = [i.text for i in _d.find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
I will show how to implement this, and see that you only need to learn BeautifulSoap syntax.
First, we want to use that method findAll for the "div" tag with the attribute "class".
divs = soup.findAll("div", attrs={"class": "credit_summary_item"})
Then, we will filter all the divs without stars in it:
stars = [div for div in divs if "Stars:" in div.h4.text]
If you have only one place with start you can take it out:
star = start[0]
Then again find all the text in tag "a"
names = [a.text for a in star.findAll("a")]
You can see that I didn't used any html/css syntax, only soup.
I hope it helped.
You can also use regex
stars = soup.findAll('a', href=re.compile('/name/nm.+'))
names = [x.text for x in stars]
names
# output: ['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

BeautifulSoup extract text from comment html [duplicate]

This question already has answers here:
How to find all comments with Beautiful Soup
(2 answers)
Closed 4 years ago.
Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:
<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>
How do I get the part 'views' and 'interaction'?
You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:
from bs4 import BeautifulSoup, Comment
html = """<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])
Outputs:
1.56M Clicks 0
If the comment is not the first on the page you would need to index it like this:
comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]
or find it from a parent element.
Updated in response to comment:
You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.
from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')
for tr in soup.find_all('tr'):
comment = tr.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
shares= commentsoup.find('span', {'class': 'shares'})
print (views.get_text(), shares['data-shares'])
Outputs:
3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...
The simplest and easiest solution would be to opt for .replace() function. All you need to do is kick out this <!-- and this --> signs from the html elements and the rest are as it is. Take a look at the below script.
from bs4 import BeautifulSoup
htdoc = """
<td>
<span class="release" data-release="1518739200"></span>
<!--<p class="statistics">
<span class="views" clicks="1564058">1.56M Clicks</span>
<span class="interaction" likes="0"></span>
</p>-->
</td>
"""
elem = htdoc.replace("<!--","").replace("-->","")
soup = BeautifulSoup(elem,'lxml')
views = soup.select_one('span.views').get_text(strip=True)
likes = soup.select_one('span.interaction')['likes']
print(f'{views}\n{likes}')
Output:
1.56M Clicks
0
If you want only the views then:
views = soup.findAll("span", {"class": "views"})
You also can get the whole paragraph with:
p = soup.findAll("p", {"class": "statistics"})
Then you can get the data from the p.

Beautiful Soup - For loop not iterating through all tags within td

I am trying to scrape some data from a website using BeautifulSoup. I can select the td tag, but it does not contain all the child tags I would expect. My goal is to iterate through the td tag that has the id="highlight_today" and retrieve all of today's events. The url I'm attempting to scrape is http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container. This is an iframe within another page, http://www.bloomberg.com/markets/economic-calendar. I think that another iframe may be the reason my for loop is not working and I'm not retrieving all the tags I would expect within this td. My html experience is very limited so I'm not sure. My code is as follows:
import re
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "html.parser")
for events in soup.find('td', {'id': "highlight_today"}):
print events.text
I am expecting to retrieve all the tags contained within the td, but it ends up stopping on this item in the html code and doesn't proceed to the next div in the td:
<span class="econoarticles">Daniel Tarullo Speaks<br></span>
There may be a better way to accomplish this than my current code. I'm still pretty amateurish at this, so open to any and all suggestions on how to accomplish my goal.
soup.find() returns one tag. Perhaps you meant to use find_all()?
Also, why are you expecting to find more than one element with a given ID? HTML IDs are (supposed to be) unique across the whole document.
Try this code. Use Selenium web driver.
from selenium import webdriver
import time
import datetime
import csv
import re
from datetime import timedelta
import pandas as pd #use pandas dataframe to store downloaded data
from StringIO import StringIO
driver = webdriver.Firefox() # Optional argument, if not specified will search path.
driver.get("http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container")
time.sleep(5)
table=driver.find_element_by_xpath("html/body/table/tbody/tr[4]/td/table[1]/tbody/tr/td/table[2]/tbody/tr[2]")
#table=driver.find_element_by_class_name('eventstable')
columns=table.find_elements_by_tag_name('td')
time.sleep(1)
#option 1 get the hole column
for col in columns:
print(col.text)
#option 2 get info row by row, but information is hided in to different classes
for col in columns_list:
rows=col.find_elements_by_tag_name('div')
for row in rows:
print(row.text)
rows=col.find_elements_by_tag_name('span')
for row in rows:
print(row.text)
The result for the last news column will be:
Market Focus »
Daniel Tarullo Speaks
10:15 AM ET
Baker-Hughes Rig Count
1:00 PM ET
John Williams Speaks
2:30 PM ET
You can parse this strings. Use different class names to search necessary info.
There is one td with the id highlight_today, all the children are contained in the tag so you just pull that and if you wanted to iterate over the children you would call find_all:
import requests
from bs4 import BeautifulSoup
url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "html.parser")
event = soup.find('td', {'id': "highlight_today"})
for tag in event.find_all():
print(tag)
Which would give you:
<div class="econoitems"><br/><span class="econoitems">Market Focus <span class="econo-item-arrow">»</span></span><br/></div>
<br/>
<span class="econoitems">Market Focus <span class="econo-item-arrow">»</span></span>
Market Focus <span class="econo-item-arrow">»</span>
<span class="econo-item-arrow">»</span>
<br/>
<br/>
<div class="itembreak"></div>
<br/>
<span class="econoarticles">Daniel Tarullo Speaks<br/></span>
Daniel Tarullo Speaks<br/>
<br/>
The html is actually broken so you will need either lxml or html5lib to parse i. Then to get what you want, you need to find the spans with the econoarticles class and do a little bit of extra work to get the times:
url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-iframe-container'
r = requests.get(url_to_scrape)
html = r.content
soup = BeautifulSoup(html, "lxml")
event = soup.find('td', {'id': "highlight_today"})
for span in event.select("span.econoarticles"):
speaker, time, a = span.text, span.find_next_sibling(text=True), span.a["href"]
print(speaker, time, a)
Which if we run, give you:
In [2]: import requests
...: from bs4 import BeautifulSoup
...: url_to_scrape = 'http://b-us.econoday.com/byweek.asp?containerId=eco-ifr
...: ame-container'
...: r = requests.get(url_to_scrape)
...: html = r.content
...: soup = BeautifulSoup(html, "lxml")
...: event = soup.find('td', {'id': "highlight_today"})
...: for span in event.select("span.econoarticles"):
...: speaker, time, a = span.text, span.find_next_sibling(text=True), spa
...: n.a["href"]
...: print(speaker, time, a)
...:
Daniel Tarullo Speaks 10:15 AM ET byshoweventfull.asp?fid=476382&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top
John Williams Speaks 2:30 PM ET byshoweventfull.asp?fid=476390&cust=b-us&year=2016&lid=0&containerId=eco-iframe-container&prev=/byweek.asp#top
In [3]:
if you want the Market Focus and the url for that, just add:
event = soup.find('td', {'id': "highlight_today"})
det = event.select_one("span.econoitems")
name, an = det.text, det.a["href"]
print(name, an )

Pulling specific (text) spaced between HTML tag during BeautifulSoup

I'm trying to pull something that is categorized as (text) when I look at it in "Inspect Element" mode:
<div class="sammy"
<div class = "sammyListing">
<a href="/Chicago_Magazine/blahblahblah">
<b>BLT</b>
<br>
"
Old Oak Tap" <---**THIS IS THE TEXT I WANT**
<br>
<em>Read more</em>
</a>
</div>
</div>
This is my code thus far, with the line in question being the bottom list comprehension at the end:
STEM_URL = 'http://www.chicagomag.com'
BASE_URL = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
soup = BeautifulSoup(urlopen(BASE_URL).read())
sammies = soup.find_all("div", "sammy")
sammy_urls = []
for div in sammies:
if div.a["href"].startswith("http"):
sammy_urls.append(div.a["href"])
else:
sammy_urls.append(STEM_URL + div.a["href"])
restaurant_names = [x for x in div.a.content]
I've tried div.a.br.content, div.br, but can't seem to get it right.
If suggesting a RegEx way, I'd also really appreciate a nonRegEx way if possible.
Locate the b element for every listing using a CSS selector and find the next text sibling:
for b in soup.select("div.sammy > div.sammyListing > a > b"):
print b.find_next_sibling(text=True).strip()
Demo:
In [1]: from urllib2 import urlopen
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(urlopen('http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'))
In [4]: for b in soup.select("div.sammy > div.sammyListing > a > b"):
...: print b.find_next_sibling(text=True).strip()
...:
Old Oak Tap
Au Cheval
...
The Goddess and Grocer
Zenwich
Toni Patisserie
Phoebe’s Bakery

Categories

Resources