web parsing using selenium and classes - python

I am trying to parse several items from a blog but I am unable to to reach the last two items I need.
The html is:
<div class="post">
<div class="postHeader">
<h2 class="postTitle"><span></span>cuba and the cameraman</h2>
<span class="postMonth" title="2017">Nov</span>
<span class="postDay" title="2017">24</span>
<div class="postSubTitle"><span class="postCategories">TV Shows</span></div>
</div>
<div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a> <br />
n/A<br />
<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
</p>
The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.
I managed to parse correctly only all the postTile for the website:
all_titles = []
url = 'http://test.com'
browser.get(url)
titles = browser.find_elements_by_class_name('postHeader')
for title in titles:
link = title.find_element_by_tag_name('a')
all_titles.append(link.text)
But I can't get the the image and imdb links using the same method as above , class name.
COuld you support me on this? Thanks.

You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:
for post in driver.find_elements_by_xpath('//div[#class="post"]'):
title = post.find_element_by_xpath('.//h2[#class="postTitle"]//a').text
img_src = post.find_element_by_xpath('.//div[#class="postContent"]//img').get_attribute('src')
link = post.find_element_by_xpath('.//div[#class="postContent"]//a[last()]').get_attribute('href')
Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything
Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:
Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

Using BeautifulSoup to collect urls from html code

i have collected a list of links from a folder of documents that essentially is wikipedia pages. I eventually realized that my list of links is incomplete, because my code only collects a few of the links from each wikipedia page. My goal is to get all links and then filter it afterwards. I should end up with a list of links to train related accidents. Keywords for such accidents in the links varies between disaster, tragedy etc. i dont know them beforehand.
My input is
list_of_urls = []
for file in files:
text = open('files_overview/'+file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for item in soup.findAll("div", attrs={'class':'mw-content-ltr'}):
url = item.find('a', attrs={'class':'href'=="accident"}):
#If i dont add something, like "accident" it gives me a syntax error..
urls= url.get("href")
urls1="https://en.wikipedia.org"+urls
list_of_urls.append(urls1)
HTML code from one of my documents, wherein multiple links lies are given below:
</div><div class="mw-category-generated" lang="en" dir="ltr"><div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (learn more).
</p><div lang="en" dir="ltr" class="mw-content-ltr"><h3>A</h3>
<ul><li>Atherstone rail accident</li></ul><h3>B</h3>
<ul><li>Bull bridge accident</li></ul><h3>H</h3>
<ul><li><span class="redirect-in-category">Helmshore rail accident</span></li></ul></div>
</div></div><noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div></div>
<div id="catlinks" class="catlinks" data-mw="interface"><div id="mw-normal-catlinks"
From the above, i manage to get Atherstone_rail_accident, but not bull_bridge nor helmshore.
Does anyone have a better approach?
Thank you for your time
What happens?
You just iterate over one element from result set of soup.findAll("div", attrs={'class':'mw-content-ltr'}), thats why you only get the first link.
Example
list_of_urls = []
for file in files:
text = open('files_overview/'+file, encoding="utf-8").read()
soup = BeautifulSoup(text, features="lxml")
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
How to fix?
Instead of selecting the <div> select all the links in your <div> and iterate over it:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(f'https://en.wikipedia.org{a["href"]}')
Output
['https://en.wikipedia.org/wiki/Atherstone_rail_accident',
'https://en.wikipedia.org/wiki/Bull_bridge_accident',
'https://en.wikipedia.org/wiki/Helmshore_rail_accident']
EDIT
Adding the prefix https://en.wikipedia.org later in the process just skip this task while appending the href to your list:
for a in soup.select('div.mw-content-ltr a'):
list_of_urls.append(a["href"])
If you like to request the urls in a second step you can do it like this:
for url in list_of_urls:
response = requests.get(f'https://en.wikipedia.org{url}')
Or if just need a list with full urls you append it with list comprehension:
list_of_urls = [f'https://en.wikipedia.org{a["href"]}' for a in list_of_urls]
You can do like this.
First find all the <div> with class name as mw-content-ltr using .find_all()
For each <div> obtained above, find all the <a> tags using .find_all(). This will give you a list of <a> for each <div>.
Iterate over and extract the href from the above list of <a> tags.
Here is the code.
from bs4 import BeautifulSoup
s = """
<div class="mw-category-generated" lang="en" dir="ltr">
<div id="mw-pages">
<h2><span class="anchor" id="Pages_in_category"></span>Pages in category "Railway accidents in 1860"</h2>
<p>The following 3 pages are in this category, out of 3 total. This list may not reflect recent changes (learn more).</p>
<div lang="en" dir="ltr" class="mw-content-ltr">
<h3>A</h3>
<ul>
<li>Atherstone rail accident</li>
</ul>
<h3>B</h3>
<ul>
<li>Bull bridge accident</li>
</ul>
<h3>H</h3>
<ul>
<li><span class="redirect-in-category">Helmshore rail accident</span></li>
</ul>
</div>
</div>
</div>
<noscript><img src="//en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1" alt="" title="" width="1" height="1" style="border: none; position: absolute;" /></noscript>
<div class="printfooter">Retrieved from "<a dir="ltr" href="https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968">https://en.wikipedia.org/w/index.php?title=Category:Railway_accidents_in_1860&oldid=895698968</a>"</div>
</div>
<div id="catlinks" class="catlinks" data-mw="interface">
"""
soup = BeautifulSoup(s, 'lxml')
divs = soup.find_all('div', class_='mw-content-ltr')
for div in divs:
for a in div.find_all('a'):
print(a['href'])
/wiki/Atherstone_rail_accident
/wiki/Bull_bridge_accident
/wiki/Helmshore_rail_accident

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

Python Beautiful soup to scrape urls from a web page

I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
Just put there a link in the scraper replacing some_link and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))
Another question. This url is not complete, I need to combine them
with "http://www.abcde.com" so the result will be
"http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but
how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link
You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])

Parsing HTML File BeautifulSoup

I'm trying to parse a local html file with BeautifulSoup but having trouble navigating the tree.
The file is in the following format:
<div class="contents">
<h1>
USERNAME
</h1>
<div>
<div class="thread">
N1, N2
<div class="message">
<div class="message_header">
<span class="user">
USERNAME
</span>
<span class="meta">
Thursday, 1 January 2015 at 19:52 UTC
</span>
</div>
</div>
<p>
They're just friends
</p>
<div class="message">
<div class="message_header">
<span class="user">
USERNAME
</span>
<span class="meta">
Thursday, 1 January 2015 at 19:52 UTC
</span>
</div>
</div>
<p>
MESSAGE
</p>
...
I want to extract, for each thread:
for each div class='message':
the span class='user' and meta data
the message in the p directly after
This is a long file with many of these threads and many messages within each thread.
So far I've just opened the file and turned it into a soup
raw_data = open('file.html', 'r')
soup = BeautifulSoup(raw_data)
contents = soup.find('div', {'class' : 'contents'})
I'm looking at storing this data in a dictionary in the format
dict[USERNAME] = ([(MESSAGE1, time1), [MESSAGE2, time2])
The username and meta info are relatively easy to grab, as they are nicely contained within their own span tags, with a class identifier. The message itself is hanging around in loose paragraph tags, this is the more tricky beast...
If you have a look at the "Going Sideways" section HERE it says "You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree".
with this in mind, you can extract the parts you want with this:
from bs4 import BeautifulSoup
your_html = "your html data"
souped_data = BeautifulSoup(your_html)
for message in souped_data.find_all("div", {"class": "message"}):
username = message.find('span', attrs={'class': 'user'}).get_text()
meta = message.find('span', attrs={'class': 'meta'}).get_text()
message = message.next_sibling
First, find all the message tags. Within each, you can search for the user and meta class names. However, this just returns the tag itself, use .get_text() to get the data of the tag. Finally, use the magical .next_sibling to get your message content, in the lonely old 'p' tags.
That gets you the data you need. As for the dictionary structure. Hmmm... I would throw them all in a list of dictionary objects. Then JSONify that badboy! However, maybe that's not what you need?

Categories

Resources