Extract a link from a webpage using Python

Extract a link from a webpage using Python - python

I have this problem: I want to extract the URL of each single project from this page, but I don't know how to do that. I tried to extract it through
projects = main_page.find_all_next('div', attrs={'class':'relative self-start'})
but I don't get the link. How can I go through it? Thank you in advance for helping me.

This website dynamically loaded content. So you need something that can run javascript. There is an easy example to access site with selenium.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.kickstarter.com/discover/categories/music"
dr = webdriver.Chrome() # or PhantomJS,Firefox
try:
dr.get(url)
main_page = BeautifulSoup(dr.page_source,"lxml")
projects = main_page.find_all('div', {'class':'relative self-start'})
project_showed = main_page.find_all("div",class_="bg-white black relative border-grey-500 border")
print(len(projects))
except Exception as e:
raise e
finally:
dr.close()
But if you can not load data in time, you should use WebDriverWait or Implicit to wait for it loading finished. WebDriverWait and Implicit

the link generated by javascript, you can't get it with BeutifulSoup, use Regex to capture url in javascript variable
import requests
import re
html = requests.get('https://www.kickstarter.com/discover/categories/music').text
listURL = re.findall(r'"project":"([^"]+)', html)
for url in listURL:
print url

Related

selenium python beautifulsoup stuck on currentpage

I am trying to scrape a public facebook group using beautifulsoup, I am using the mobile site for the lack of javascript there. So this script supposed to get the link from the 'more' keyword and get the text from p tag there, but it just gets the text from the current page's p tag. Can someone point me the problem? I am new to python and everything in this code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import requests
browser = webdriver.Firefox()
browser.get('https://mobile.facebook.com/groups/22012931789?refid=27')
for elem in browser.find_elements_by_link_text('More'):
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())

It's always useful to see what your script is actually doing, a quick way of doing this is by printing your results at certain steps along the way.
For example, using your code:
for elem in browser.find_elements_by_link_text('More'):
print("elem's href attribute: {}".format(elem.get_attribute("href")))
You'll notice that the first one's blank. We should test for this before trying to get requests to fetch it:
for elem in browser.find_elements_by_link_text('More'):
if elem.get_attribute("href"):
print("Trying to get {}".format(elem.get_attribute("href")))
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
Note that an empty elem.get_attribute("href") returns an empty unicode string, u'' - but pythons considers an empty string to be false, which is why that if works.
Which works fine on my machine. Hope that helps!

Using Selenium on Amazon

I do not understand why selenium will not input my data into amazon search. I know it opens the chrome browser to amazon but it will not fill in the search bar. Any ideas whats wrong with my code
from lxml import html, etree
import csv,os,json
import requests
from time import sleep
from selenium import webdriver
textsearch = "Taco Bell Sauce"
browser = webdriver.Chrome('/home/path/Documents/Selenium/chromedriver')
browser.get("http://www.amazon.com/")
content = browser.page_source
doc = html.fromstring(content)
search = selenium.find_element_by_id("twotabsearchtextbox")
search.send_keys(textsearch)
search.selenium.find_element_by_id("nav-search-submit-text").click()
Any corrections on how i can make this work

This is simply because you should handle the WebDriver instance, that you've created - browser instead of selenium which is Python library that contains webdriver...
So just replace
search = selenium.find_element_by_id("twotabsearchtextbox")
with
search = browser.find_element_by_id("twotabsearchtextbox")
P.S. Also replace
search.selenium.find_element_by_id("nav-search-submit-text").click()
with
browser.find_element_by_id("nav-search-submit-text").click()
or
search.submit()

You need to make a couple of adjustments in your code as follows:
The webdriver instance gets assigned to browser so while using find_element you need to use the browser. The Search Box and the Search Button are within input tag so better to construct an xpath or a css_selector as follows :
from lxml import html, etree
import csv,os,json
import requests
from time import sleep
from selenium import webdriver
textsearch = "Taco Bell Sauce"
browser = webdriver.Chrome(executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
browser.get("http://www.amazon.com/")
content = browser.page_source
doc = html.fromstring(content)
search = browser.find_element_by_xpath("//input[#id='twotabsearchtextbox']")
search.send_keys(textsearch)
search.find_element_by_xpath("//input[#class='nav-input']").click()

python parse evernote shared notebook

I am trying to get data from evernote 'shared notebook'.
For example, from this one: https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c
I tried to use Beautiful Soup:
url = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
r = requests.get(url)
bs = BeautifulSoup(r.text, 'html.parser')
bs
The result doesn't contain any text information from the notebook, only some code.
I also seen an advice to use selenium and find elements by XPath.
For example I want to find the head of this note - 'Term 3 Week2'. In Google Chrome i found that it's XPath is '/html/body/div[1]/div[1]/b/span/u/b'.
So i tried this:
driver = webdriver.PhantomJS()
driver.get(url)
t = driver.find_element_by_xpath('/html/body/div[1]/div[1]/b/span/u/b')
But it also didn't work, the result was 'NoSuchElementException:... '.
I am a newbie in python and especially parsing, so I would be glad to receive any help.
I am using python 3.6.2 and jupiter-notebook.
Thanks in advance.

The easiest way to interface with Evernote is to use their official Python API.
After you've configured your API key and can generally connect, you can then download and reference Notes and Notebooks.
Evernote Notes use their own template language called ENML (EverNote Markup Language) which is a subset of HTML. You'll be able to use BeautifulSoup4 to parse the ENML and extract the elements you're looking for.
If you're trying to extract information against a local installation (instead of their web app) you may also be able to get what you need from the executable. See how to pass arguments to the local install to extract data. For this you're going to need to use the Python3 subprocess module.
HOWEVER
If you want to use selenium, this will get you started:
import selenium.webdriver.support.ui as ui
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# your example URL
URL = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
# create the browser interface, and a generic "wait" that we can use
# to intelligently block while the driver looks for elements we expect.
# 10: maximum wait in seconds
# 0.5: polling interval in seconds
driver = Chrome()
wait = ui.WebDriverWait(driver, 10, 0.5)
driver.get(URL)
# Note contents are loaded in an iFrame element
find_iframe = By.CSS_SELECTOR, 'iframe.gwt-Frame'
find_html = By.TAG_NAME, 'html'
# .. so we have to wait for the iframe to exist, switch our driver context
# and then wait for that internal page to load.
wait.until(EC.frame_to_be_available_and_switch_to_it(find_iframe))
wait.until(EC.visibility_of_element_located(find_html))
# since ENML is "just" HTML we can select the top tag and get all the
# contents inside it.
doc = driver.find_element_by_tag_name('html')
print(doc.get_attribute('innerHTML')) # <-- this is what you want
# cleanup our browser instance
driver.quit()

PhantomJS (Selenium) Cannot Load PDFs from direct urls

I was trying to save some PDF from a link via PhantomJS (selenium). So, I refered to this code that turns webpages to pdfs. and it worked just fine when I ran the exactly same code.
So, I have this pdf I wanted to save from a direct url and I tried that script... it didn't work. It just saves a PDF with 1 white page. That's all...
My Code :
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def execute(script, args):
driver.execute('executePhantomScript', {'script': script, 'args' : args })
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
try:
WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin')))
except Exception as TimeoutException:
print("I waited for far too long and still couldn't fine the view.")
pass
# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])
# render current page
render = '''this.render("test2.pdf")'''
execute(render, [])
I'm not sure what's happening and why is it happening. Need some assistance.
EDIT: This is just the test PDF that I was trying to get via Selenium. There are some other PDFs which I need to get and that website is checking god-knows-what to decide whether it's a human or a bot. So, Selenium is the only way.
EDIT 2 : So, here's the website I was practicing on : http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web
Select "Cr Rev - Criminal Revision" from "Case Type" drop down and input any number in case number and year. Click on "Go".
This will show a little table, click on "view" and it should show a table on full page.
Scroll down to the "orders" table and you should see "Copy of order". That's the pdf I'm trying to get.I have tried requests as well and it did not work.

Currently, PhantomJS and Chrome headless doesn't support download a file. If you are OK with Chrome browser, please see my example below. It finds a elements, and then add an attribute download. Finally, it clicks on the link to download the file to default Downloads folder.
import time
driver = webdriver.Chrome()
driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/')
pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a")
for pdfLink in pdfLinks:
script = "arguments[0].setAttribute('download',arguments[1]);"
driver.execute_script(script, pdfLink, pdfLink.text)
time.sleep(1)
pdfLink.click()
time.sleep(3)
driver.quit()

If you're just looking at downloading PDFs which aren't protected behind some javascript or stuff (essentially straightforward stuff), I suggest using the requests library instead.
import requests
url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf'
r = requests.get(url)
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
f.write(r.content)
# If large file
with requests.get(url, stream=True) as r:
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)

I recommend you look at the pdfkit library.
import pdfkit
pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf')
It makes downloading pdfs very simple with python. You will also need to download this for the library to work.
You could also try the code from this link shown below
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
button = browser.find_element_by_name('button')
button.click()
# wait for the page to load
WebDriverWait(browser, timeout=10).until(
lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
# store it to string variable
page_source = browser.page_source
print(page_source)
which you will need to edit to make work for your pdf.

LXML XPATH - Data returned from one site and not another

I'm just learning python and decided to play with some website scraping.
I created 1 that works, and a second, almost identical as far as I can tell, that doesn't work, and I can't figure out why.
from lxml import html
import requests
page = requests.get('https://thronesdb.com/set/Core')
tree = html.fromstring(page.content)
cards = [tree.xpath('//a[#class = "card-tip"]/text()'),tree.xpath('//td[#data-th = "Faction"]/text()'),
tree.xpath('//td[#data-th = "Cost"]/text()'),tree.xpath('//td[#data-th = "Type"]/text()'),
tree.xpath('//td[#data-th = "STR"]/text()'),tree.xpath('//td[#data-th = "Traits"]/text()'),
tree.xpath('//td[#data-th = "Set"]/text()'),tree.xpath('//a[#class = "card-tip"]/#data-code')]
print(cards)
That one does what I expect (I know it's not pretty). It grabs certain elements from a table on the site.
This one returns [[]]:
from lxml import html
import requests
page = requests.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
tree = html.fromstring(page.content)
offers = [tree.xpath('//a[#class = "offer_title"]/text()')]
print(offers)
What I expect it to do is give me a list that has the text from each offer_title element on the page.
The xpath I'm gunning at I grabbed from Firebug, which is:
/html/body/div[1]/div/div/div/section/div[2]/ul[1]/li[2]/div/h3/a
Here's the actual string from the site:
Angelcare Digital Video And Sound Monitor - $89.99 ($90.00 Off)
I have also read a few other questions, but they didn't answer how this could work the first way, but not the second. Can't post them because of the link restrictions on new accounts.
Titles:
Python - Unable to Retrieve Data From Webpage Table Using Beautiful
Soup or lxml xpath
Python lxml xpath no output
Trouble with scraping text from site using lxml / xpath()
Any help would be appreciated. I did some reading on the lxml website about xpath, but I may be missing something in the way I'm building a query.
Thanks!

The reason why the first code is working is that required data is initially present in DOM while on second page required data is generated dynamically by JavaScript, so you cannot scrape it because requests doesn't support handling dynamic content.
You can try to use, for example, Selenium + PhantomJS to get required data as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS(executable_path='/path/to/phantomJS')
driver.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
xpath = '//a[#class = "offer_title"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
offers = [link.get_attribute('textContent') for link in driver.find_elements_by_xpath(xpath)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract a link from a webpage using Python - python

Related

selenium python beautifulsoup stuck on currentpage

Using Selenium on Amazon

python parse evernote shared notebook

PhantomJS (Selenium) Cannot Load PDFs from direct urls

LXML XPATH - Data returned from one site and not another

Categories

Resources