i need to get screenshot of the page. Site https://kad.arbitr.ru/ blocked selenium. After pressing the search button site do nothing. In inspector i see POST request using XHR. How can i execute POST request at this site? Sorry for stupid question, im newbie in Python. Maybe you can suggest other solution
fp = webdriver.FirefoxProfile()
options = Options()
#options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'FILES\\geckodriver.exe', firefox_profile=fp)
driver.get('https://kad.arbitr.ru/')
#pickle.dump( driver.get_cookies() , open("cookies.pkl","wb")) #Cookies
#response = webdriver.request('POST', 'https://kad.arbitr.ru/Kad/SearchInstances')
#print(response)
for request in driver.requests:
if request.response:
print(
request.url,
request.response.status_code,
request.response.headers['Content-Type']
)
You can take screenshots in a simple way: using driver.save_screenshot('screenshot.png')
You can read more in the first link in google: web
from selenium import webdriver
from PIL import Image
from Screenshot import Screenshot_Clipping
fp = webdriver.FirefoxProfile()
options = Options() # I don't know what is it in your code
driver = webdriver.Firefox(options=options, executable_path=r'FILES\\geckodriver.exe', firefox_profile=fp)
driver.get('https://kad.arbitr.ru/')
driver.save_screenshot('ss.png') # this is the answer
screenshot = Image.open('ss.png')
screenshot.show()
ss = Screenshot_Clipping.Screenshot() # another method
image = ss.full_Screenshot(driver, save_path=r'.' , image_name='name.png')
You can take a screenshot with selenium pretty easily:
driver.save_screenshot("image.png")
This question is so popular, there are dozens of APIs dedicated to taking screenshots of a webpage. You can find them out by doing a Google search.
Related
I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)
you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)
I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.
Was hoping someone could help me understand what's going on:
I'm using Selenium with Firefox browser to download a pdf (need Selenium to login to the corresponding website):
le = browser.find_elements_by_xpath('//*[#title="Download PDF"]')
time.sleep(5)
if le:
pdf_link = le[0].get_attribute("href")
browser.get(pdf_link)
The code does download the pdf, but after that just stays idle.
This seems to be related to the following browser settings:
fp.set_preference("pdfjs.disabled", True)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
If I disable the first, it doesn't hang, but opens pdf instead of downloading it. If I disable the second, a "Save As" pop-up window shows up. Could someone explain how to handle this?
For me, the best way to solve this was to let Firefox render the PDF in the browser via pdf.js and then send a subsequent fetch via the Python requests library with the selenium cookies attached. More explanation below:
There are several ways to render a PDF via Firefox + Selenium. If you're using the most recent version of Firefox, it'll most likely render the PDF via pdf.js so you can view it inline. This isn't ideal because now we can't download the file.
You can disable pdf.js via Selenium options but this will likely lead to the issue in this question where the browser gets stuck. This might be because of an unknown MIME-Type but I'm not totally sure. (There's another StackOverflow answer that says this is also due to Firefox versions.)
However, we can bypass this by passing Selenium's cookie session to requests.session().
Here's a toy example:
import requests
from selenium import webdriver
pdf_url = "/url/to/some/file.pdf"
# setup driver with options
driver = webdriver.Firefox(..options)
# do whatever you need to do to auth/login/click/etc.
# navigate to the PDF URL in case the PDF link issues a
# redirect because requests.session() does not persist cookies
driver.get(pdf_url)
# get the URL from Selenium
current_pdf_url = driver.current_url
# create a requests session
session = requests.session()
# add Selenium's cookies to requests
selenium_cookies = driver.get_cookies()
for cookie in selenium_cookies:
session.cookies.set(cookie["name"], cookie["value"])
# Note: If headers are also important, you'll need to use
# something like seleniumwire to get the headers from Selenium
# Finally, re-send the request with requests.session
pdf_response = session.get(current_pdf_url)
# access the bytes response from the session
pdf_bytes = pdf_response.content
I highly recommend using seleniumwire over regular selenium because it extends Python Selenium to let you return headers, wait for requests to finish, use proxies, and much more.
I am trying to save videos from this url:
Original:
https://api2.musical.ly/aweme/v1/play/?video_id=v09044a20000beeff4c108gs7sflfdug
Link changes to this:
http://v16.muscdn.com/3d238aa3e1c34000ce53792155cd0e15/5bcf3070/video/tos/maliva/tos-maliva-v-0068/e5a1ab74d0b54f97b3578924a428e58d/
The video is from TikTok. When you go to the url, it instantly redirects you to another url. The other url is the one I want in order to save the video. However, the url it directs you to does not have a "view html source" option. I can inspect the element and that shows it has a video tag, but I cannot find a way to save the url between the tag. I am using python and beautifulsoup. I tried to do this with selenium, but to no effect.
Edit:
The link that it redirects to changes all the time! as of 27/08/2019, the link below works...
If you get Access denied you should check the link once again...
I think you should use other libraries for saving videos...
For example (in Python 3+):
import urllib.request
vid_url = "http://v19.muscdn.com/21b98c731608b8aa296ec31468c26dd1/5d652a88/video/tos/maliva/tos-maliva-v-0068/e5a1ab74d0b54f97b3578924a428e58d/?rc=amdvdnY7NDdpaDMzNTczM0ApdSlINzU2NTM0MzM2MzM1MzQ1b2k5ZmU5Z2c1ZGY5ZmQzPGZAaUBoNnYpQGczdilAZjY1QHJjYzRkLWBjYl8tLV4xNnNzOmk0NTU1LjQtLi4uMTQ0NTYtOiM2MDAtXjQzXzMxMTFeMWEzYSNvIzphLW8jOmAtbyMwLl4%3D"
urllib.request.urlretrieve(vid_url, "your_video_name.mp4")
If you insist on using selenium you can add options like this:
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {
"download.default_directory": r"C:\Users\xxx\downloads\Test",
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"safebrowsing.enabled": True
})
driver = webdriver.Chrome(chrome_options=options)
Hope this helps you!
The funny thing is I'm not getting any errors running this code but I do believe that the script isnt using a proxy when it reloads the page. here's the script
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chop = webdriver.ChromeOptions()
proxy_list = input("Name of proxy list file?: ")
proxy_file = open(proxy_list, 'r')
print ('Enter url')
url = input()
driver = webdriver.Chrome(chrome_options = chop)
driver.get(url)
import time
for x in range(0,10):
import urllib.request
import time
proxies = []
for line in proxy_file:
proxies.append( line )
proxies = [w.replace('\n', '') for w in proxies]
while True:
for i in range(len(proxies)):
proxy = proxies[i]
proxy2 = {"http":"http://%s" % proxy}
proxy_support = urllib.request.ProxyHandler(proxy2)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
urllib.request.urlopen(url).read()
time.sleep(5)
driver.get(url)
time.sleep (5)
Just wondering how I can use a proxy list with this script and have it work properly
As I know, its almost impossible to make selenium work with proxy on Firefox & Chrome. I tried 1 day and nothing. Opera I didnt try, but possible will be the same.
Also I saw freelancer request like "I have a problem with proxy on Selenium + Chrome. I'm the developer and lost 2 days to make it work,but nothing as result. If you don't know sure how to make it working good, please dont't disturb me -- its harder than just copy-paste from internet."
But its easy to do with Phantom JS -- this feature is working good there.
So try to do needed actions via PJS.
I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.