Web scraping using BeautifulSoup - python

I am trying to scrape data from yellow pages, website being this
I want this div class= search-results listing-group
I tried this
parent = soup.find('div',{'class':"search-results listing-group"})
But, I am not getting any result.

This URL has anti-scraping protections in place, which resist programmatic HTML extraction. That's the main reason why you're not getting any output. You can see this by examining the raw data returned from a request:
from bs4 import BeautifulSoup
import requests
url = "https://www.yellowpages.com.au/find/boat-yacht-sales/melbourne-vic"
soup = BeautifulSoup(requests.get(url).text)
print(soup)
Excerpt:
This page appears when online data protection services detect requests coming from your computer network which appear to be in violation of our website's terms of use.

Are you using requests?
It appears the webpage does not allow automated scraping, atleast using Beautiful Soup.
I tried scraping it for you and this is what I see in the content.
<p style="font-weight: bold;">Why did this happen?</p>
<p style="margin-top: 20px;">This page appears when online data protection services detect requests coming from your computer network which appear to be in violation of our website's terms of use.</p>
</div>, <div style="border-bottom: 1px #E7E7E7 solid;
margin-top: 20px;
margin-bottom: 20px;
height: 1px;
width: 100%;">
</div>, <div style="margin-left: auto;
margin-right: auto;
font-size: 20px;
max-width: 460px;
text-align: center;">
We value the quality of content provided to our customers, and to maintain this, we would like to ensure real humans are accessing our information.</div>, <div style="margin-left: auto;
margin-right: auto;
margin-top: 30px;
max-width: 305px;">
You might have to try other (legitimate) methods of scraping it.

It seems the page you are accessing doesn't allow static scraping , you need to use advance scraping using selenium like this..
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
url = "https://www.yellowpages.com.au/find/boat-yacht-sales/melbourne-vic"
driver=webdriver.Chrome(executable_path="{location}/chromedriver")
driver.get(url)
content_element = driver.find_elements_by_xpath("//div[#class='search-results
listing-group']")
content_html = content_element[0].get_attribute("innerHTML")
soup = BeautifulSoup(content_html, "html.parser")
print soup
since the class name contains space , so you need to use something like xpath or id to fetch the data.
For more info on advance scraping read this one:
https://medium.com/dualcores-studio/advanced-web-scraping-in-python-d19dfccba235

Related

Change canvas style size inside several browser windows using python

I have several browsers open and each has a webpage already open that has some div and canvas elements.
<div class="webgl-content">
<div id="game" class="game" style="padding: 0px; margin: 0px; border: 0px; background: rgb(25, 39, 54);">
<canvas id="#canvas" style="width: 100%; height: 100%; cursor: default;" width="1400" height="800"></canvas>
</div>
</div>
How can I change the height:100% to a different value (say 77%) in all browser windows using python?
I am thinking something that can identify the open browsers, then find the canvas element and change the style height percentage to a new value.
I've been playing around with pywinauto library but not much progress so far: I get errors indicating that Chrome application was not found.

Trying to scrape image and I get empty output

I'm trying to scrape Twitter account image, I tried multiple ways and the output keeps give me empty list!
My Code:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://twitter.com/jack/photo')
soup = BeautifulSoup(url.text, 'lxml')
image = soup.find_all('img')
print(image)
Output:
[]
That's a part of my project .. I tried lxml and find by class, but I still get nothing, maybe I'm missing something there but I don't know what it is.
If anyone can help me with it, I will be so appreciated.
Thanks in advance
I can see some React being used in the page. If you open the page and inspect the elements, you will see that as soon as you click on the photo to enlarge it, a new div appears as if from thin air. Which implies that that get created by react.
In order to address this you will need to use Selenium to open the page in a virtual browser, let the JavaScript do its magic and then look for the img tag.
You're trying to scrape the path for JavaScript twitter. If you examine the response of your page you will see the following snippit.
<form action="https://mobile.twitter.com/i/nojs_router?path=%2Fjack%2Fphoto" method="POST" style="background-color: #fff; position: fixed; top: 0; left: 0; right: 0; bottom: 0; z-index: 9999;">
<div style="font-size: 18px; font-family: Helvetica,sans-serif; line-height: 24px; margin: 10%; width: 80%;">
<p>We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?</p>
<p style="margin: 20px 0;">
<button type="submit" style="background-color: #1da1f2; border-radius: 100px; border: none; box-shadow: none; color: #fff; cursor: pointer; font-size: 14px; font-weight: bold; line-height: 20px; padding: 6px 16px;">Yes</button>
</p>
</div>
</form>
I would recommend disabling javascript in your browser and then figuring out how to view the photos like that. Then you could mimic those requests using requests.
What worked for me was sendind a request to the path:
https://mobile.twitter.com/jack
Then using the css selector: class = "avatar". There should be one child, an image tag, grab the src of that image tag and that should be the link to your photo.
As requested, here is the python code I used:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://mobile.twitter.com/jack')
soup = BeautifulSoup(response.text, 'lxml')
avatars = soup.findAll("td", {"class": "avatar"})
print(avatars[0].findAll('img')[0].get('src'))
Note: Twitter changes their layout frequently, so this may not work for long.

Django Download third party video instead of opening in new/same tab

I am currently trying to make a user download a video on clicking a link . But everytime on clicking the URL(href) it opens it a new tab. The video URL is a third-party URL(Instagram CDN).
Here is my code snippet in my template.html for the page .
<a target="_parent" href="{{ story.url }}&dl=1" rel="noopener noreferrer" style="text-align: center;
padding: 20px 0px;color: rgb(38, 38, 38);
font-size: 16px;
font-weight: 500;
transition: all 0.2s ease 0s;" download>DOWNLOAD</a>
I intentionally added '&dl=1' to the URL part hoping it might make some impact. This makes the images link downloadable but not the video links. Removing/Adding the attribute 'download' seems to have no impact.
What should I do to force download it and not open in new tab

Nested hidden tags scraping in python

First things first, very new to python and web scraping.
I have a page that needs to be scraped. I was looking at a lot of sources and wasn't able to figure out the scraping of nested hidden tags. The page requires a login and being able to grab the visible data, which my code successfully executes. However, when it comes to scraping the nested elements within a div tag, it doesn't find anything.
HTML (before onClick() event)
<div id="topMenu" style="width: 1920px; position: relative; top: 46px;" onclick="menu(event);" oncontextmenu="javascript:if(!event.ctrlKey){return RightClickPopUp(event);}">
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: hidden; display: none; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: hidden; display: none; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
</div>
After I click on the div (consists of multiple buttons), the first span tag becomes visible and then jumps into its appropriate nested span tag. My problem is to access the text in the innermost span.
HTML (After onClick() event)
<div id="topMenu" style="width: 1920px; position: relative; top: 46px;" onclick="menu(event);" oncontextmenu="javascript:if(!event.ctrlKey){return RightClickPopUp(event);}">
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: visible; display: inline; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
<span id="3" class="cSub" lcid="63" lccl="Item" style="visibility: visible; display: inline; top: 20px;">
<span id="1" menuname="Cancel" parentid="63" class="Menu01" showmenu="010">Cancel</span>
</span>
</div>
Python Code
import mechanize
from bs4 import BeautifulSoup
import urllib
import http.cookiejar as cookielib
from bs4 import BeautifulSoup as soup
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("LOGIN_URL")
br.select_form(nr=0)
br.form['USER'] = 'un'
br.form['PASSWORD'] = 'pwd'
br.submit()
check = br.response().read()
print(check) //login success
my_url = br.open("URL_I_NEED_TO_SCRAPE").read()
page_soup = soup(my_url, "html.parser")
containers = page_soup.find("div",{"id":"topMenu"})
This code helps me get the div, but nothing inside it. Is there a way to get the spans that are currently hidden inside this div?
there are many ways to extract inner hidden elements like span, src, and alt tag.
containers = page_soup.find("div",{"id":"topMenu"})
top_span=containers.find_all('span',class_='cSub')
print(len(top_span)
#len of spans is two
inner_span=top_span[0].find('span')
inner_span_text=inner_span.text
class_inside_inner_span=inner_span['class']
for more details of web scraping follow my this post: "https://github.com/rajat4665/web-scraping-with-python"

Beautiful Soup Get a Certain Paragraph

I am trying to get a certain paragraph of text from a website, but my current methodology is not working.
I want the paragraph at the bottom. Thank you for your help, and I apologize for being a novice. I tried reading the docs but could not decipher much.
from bs4 import BeautifulSoup
import requests
url = "https://pwcs.edu/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
container = soup.find("div",attrs={'class': 'alertWrapper'})
paragraph = container.find("p")
When I print paragraph.getText() I get a bunch of blank space but no errors.
The html is :
<div id="page">
<div id="em-alerts" role="alert">
<div class="alertWrapper">
<div class="container">
<span class="icon dom-bg">
<em class="fa fa-bell">
<!---->
</em>
</span>
<span id="alert">ALERT</span>
<p>All PWCS will open two hours late on Thursday, February 8, due to icy road conditions in certain areas. SACC will open two hours late. Parents always have the option to keep children home if they have safety concerns.
</p>
<p></p>
</div>
</div>
</div>
I want the paragraph at the bottom. Thank you for your help, and I apologize for being a novice. I tried reading the docs but could not decipher much.
First you can get as close as you can to the paragraphs:
container = soup.find('div', attrs={'class':'container'})
Then you look for all the <p> tags in the container and join them.
\n'.join([x.text for x in container.find_all('p') if x.text != ""])
This will put all the paragraphs together, linked by a newline between each paragraph if they're not blank.
Output:
'All PWCS will open two hours late on Thursday, February 8, due to icy
road conditions in certain areas. SACC will open two hours late.
Parents always have the option to keep children home if they have
safety concerns.\n '
soup = BeautifulSoup(data, "lxml")
container = soup.find("div",attrs={'class': 'alertWrapper'})
paragraph = container.find("p")
In you above code you will be getting only first "p" tag. container.find("p") only gives you first "p" tag.
And the first tag you are getting is empty one.
You can check page source of that website.
But actually container has multiple "p" tags in it.
What you need to do is:
for p in container.find_all("p"):
print p.text
Following is the Html content in alertWrapper class present on your website.
<div class="alertWrapper">
<div class="container"><span class="icon dom-bg"><em class="fa fa-bell"><!-- --></em></span>
<!--First "p" tag which is empty-->
<p>               
</p>
<table align="center" cellpadding="2" cellspacing="2" class="" style="border: 3px solid rgb(0, 176, 240);">
<tbody>
<tr>
<td class=""
style="margin: 2px; padding: 2px; border-image-source: none; border-image-slice: initial; border-image-width: initial; border-image-outset: initial; border-image-repeat: initial; background-color: rgb(255, 255, 255);">
<ul>
<!--Second "p" tag which you want-->
<p style="text-align: left; margin-left: 120px;"><strong><span
style='font-size: medium; letter-spacing: normal; font-family: "Times New Roman"; color: rgb(0, 112, 192);'>The PWCS Parent Divisionwide surveys, sent on January 9, were unexpectedly delayed at the US Post Office distribution center. The deadline for the parent survey, both paper and online, has been extended to Friday, February 9, 2018. </span></strong>
</p>
</ul>
</td>
</tr>
</tbody>
</table>
</div>
</div>
If you right click and check the page source, the text you want is not available. The HTML you've provided and the page source doesn't match.
<div class="alertWrapper">
<div class="container"><span class="icon dom-bg"><em class="fa fa-bell"><!----></em></span><p>
<table style="border: 3px solid rgb(0, 176, 240);" align="center" cellpadding="2" cellspacing="2" class="">
<tbody>
This is happening because the content you want is generated dynamically by JavaScript. You won't be able to scrape that using requests module.
You'll have to use other tools like Selenium.
As of now, there are multiple divs on that page with class "container". Therefore you could use find_all() method instead of find(). For example, like this:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://pwcs.edu/")
soup = BeautifulSoup(r.text, "lxml")
n = 0
for container in soup.find_all("div",attrs={'class': 'container'}):
n += 1
print('==',n,'==')
for paragraph in container.find_all("p"):
print(paragraph)
Alternatively, you can use .next_sibling:
for span in soup.find_all("span",attrs={'id': 'alert'}):
if span.next_sibling:
print('ALERT',span.next_sibling)

Categories

Resources