Trying to find the right variable to screen scraping.

Trying to find the right variable to screen scraping. - python

I have code written out, have tested the first bit. (The logging into website) but I am trying to add on a screen scraping part into the code and am having a bit of trouble getting the result that I want. When I run the code I get "None" im unsure what is causing this. I think it is due to me maybe not having the right attribute that it is trying to scrape.
import requests
import urllib2
from bs4 import BeautifulSoup
with requests.session() as c:
url = 'https://signin.acellus.com/SignIn/index.html'
USERNAME = 'My user name'
PASSWORD = 'my password'
c.get(url)
login_data = dict(Name=USERNAME, Psswrd=PASSWORD, next='/')
c.post(url, data=login_data, headers={"Referer": "https://www.acellus.com/"})
page = c.get('https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326')
quote_page = 'https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find('div', attrs={'class':'Object7069'})
price = price_box
print price
This is a screenshot of the "inspect element" of the data I want to screen scrape

I don't think using requests and urllib2 to log in is a good idea. There is mechanize module for python2.x using which you could log in through forms and retrieve content. Here is how your code would look like.
import mechanize
from bs4 import BeautifulSoup
# logging in...
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("https://signin.acellus.com/SignIn/index.html")
br.select_form(nr=0)
br['AcellusID'] = 'your username'
br['Password'] = 'your password'
br.submit()
# parsing required information..
quote_page = 'https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326'
page = br.open(quote_page).read()
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find('div', attrs={'class':'Object7069'})
price = price_box
print price
Reference link: http://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize/
P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).

Related

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1

#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()

The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

Selenium and BeautifulSoup: sharing and pulling session data resources to multiple libraries in python

I have problems comparing two libraries in Python 3.6. I use Selenium Firefox WebDriver to log into a website, but when I want BeautifulSoup or Requests to read that website, it reads the link, but differently (reads that page as if I have not logged in). How can I tell Requests that I have already logged in?
Below is the code I have written so far ---
from selenium import webdriver
import config
import requests
from bs4 import BeautifulSoup
#choose webdriver
browser=webdriver.Firefox(executable_path="C:\\Users\\myUser\\geckodriver.exe")
browser.get("https://www.mylink.com/")
#log in
timeout = 1
login = browser.find_element_by_name("sf-login")
login.send_keys(config.USERNAME)
password = browser.find_element_by_name("sf-password")
password.send_keys(config.PASSWORD)
button_log = browser.find_element_by_xpath("/html/body/div[2]/div[1]/div/section/div/div[2]/form/p[2]/input")
button_log.click()
name = "https://www.policytracker.com/auctions/page/"
browser.get(name)
name2 = "/html/body/div[2]/div[1]/div/section/div/div[2]/div[3]/div[" + str(N) + "]/a"
#next page loaded
title1 = browser.find_element_by_xpath(name2)
title1.click()
page = browser.current_url -------> this save url from website that i want to download content (i've already logged in that page)
r = requests.get(page) ---------> i want requests to go to this page, he goes, but not included logged in proceder.... WRONG
r.content
soup = BeautifulSoup(r.content, 'lxml')
print (soup)

If you simply want to pass the page source to BeautifulSoup, you can get the page source from selenium and then pass it to BeautifulSoup directly (no need of requests module).
Instead of
page = browser.current_url
r = requests.get(page)
soup = BeautifulSoup(r.content, 'lxml')
you can do
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')

Scrape password protected website with no token

(I'm sorry for my english i'll try to do my best) :
I'm a newbie in python and i'm seeking for help for some web scraping. I already have a functionable code to get the links i want but the website is protected by a password.
with the help of a lot of question i read i managed to get a working code to scrape the website after the login but the links i want are on another page :
the login page is http://fantasy.trashtalk.co/login.php
the landing page (the one i scrape with this code) after login is http://fantasy.trashtalk.co/
and the page i want is http://fantasy.trashtalk.co/?tpl=classement&t=1
So i have this code (some import are probably useless, they come from another code):
from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re
username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"
values = {'email': username,
'password': password}
r = requests.post(log, data=values)
# Not sure about the code below but it works.
data = r.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
I understand that this code only allow me to access to the login page then scrape what come next (the landing page) but i don't figure out how to "save" my loggin info to access the page i want to scrape.
i think i should add something like this after the login code but when i do it it only scrape my links from the login page :
s = request.get(url)
Also i read some topic here using "with session" thing ? but i didn't managed to make it work.
Any of help would be appreciated. Thank you for your time.

The issue was that you needed to save your login credentials by posting them through a session object, not a request object. I've modified your code below and you now have access to the html tags located in the scrape_url page. Good luck!
import requests
from bs4 import BeautifulSoup
username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'
login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}
#Start session.
session = requests.session()
#Login using your authentication information.
session.post(url=login_url, data=login_info)
#Request page you want to scrape.
url = session.get(url=scrape_url)
soup = BeautifulSoup(url.content, 'html.parser')
for link in soup.findAll('a'):
print('\nLink href: ' + link['href'])
print('Link text: ' + link.text)

Python website login and scrape page (delayed?)

I'm trying to scrape a webpage that's behind a login page.
I know how to login using Python's requests.session().
However, when I retrieve the webpage, it seems to be not fully loaded.
The html I receive is different from the html shown when I login through a browser.
My code is this:
session = requests.session()
login_data = {'email': 'myemailaddress', 'password': 'mypassword'}
session.post(url_login, login_data)
r = session.get(url_homepage)
soup = bs(r.content, 'lxml')
print(soup.prettify())
I'm getting the impression that the site does some scripting or redirecting after the initial loading of url_homepage
I've already tried to put in a time.sleep(10) between the post and the get, but that doesn't do the trick.
I'm guessing I need to have session.get() to wait a number of seconds before it does the actual get, but session.get() doesn't allow that.
Does anybody know how to do this, or can give me suggestions on how to proceed please?
I'm using Python 3.6 but solutions for other versions are ok too.

For example (Linkedin).
You have to download ChromeDriver (or other drivers), look at selenium documentation
import time
from selenium import webdriver
from bs4 import BeautifulSoup
def main():
username = 'my_login'
password = 'my_pass'
linkedin = 'https://www.linkedin.com/uas/login'
#sign in
browser = webdriver.Chrome()
browser.get(linkedin)
browser.find_element_by_name("session_key").send_keys(username)
browser.find_element_by_name("session_password").send_keys(password)
browser.find_element_by_name("signin").click()
time.sleep(3)
#scrape
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup)
#log out
browser.find_element_by_id("nav-settings__dropdown-trigger").click()
browser.find_element_by_link_text("Sign out").click()
browser.quit()
if __name__ == '__main__':
main()

python web scraping with missing source code

I am trying to scrape the pricing information from these two websites: site1 and site2
I am using Python and packages BeautifulSoup and requests.
What I realized is that the pricing section is not available in the source code for both sites. So I am wondering how I can scrape the data.
Any advice would be appreciated. Thank you

The problem is that first you need to select a country to see the prices.
In technical sense, you need to make a POST request to http://www.strem.com/catalog/index.php to select a country, then you can get the prices:
from bs4 import BeautifulSoup
import requests
URL = "http://www.strem.com/catalog/v/29-6720/17/copper_1300746-79-5"
session = requests.session()
p = session.post("http://www.strem.com/catalog/index.php", {'country': 'USA',
'page_function': 'select_country',
'item_id': '7211',
'group_id': '17'})
response = session.get(URL)
soup = BeautifulSoup(response.content)
print [td.text.strip() for td in soup.find_all('td', class_='price')]
This prints:
[u'US$85.00', u'US$285.00', u'US$1,282.00', u'US$3,333.00']
A more elegant solution would be to submit a form using mechanize package:
import cookielib
from bs4 import BeautifulSoup
import mechanize
URL = "http://www.strem.com/catalog/v/29-6720/17/copper_1300746-79-5"
browser = mechanize.Browser()
cj = cookielib.LWPCookieJar()
browser.set_cookiejar(cj)
browser.open(URL)
browser.select_form(nr=1)
browser.form['country'] = ['USA']
browser.submit()
data = browser.response().read()
soup = BeautifulSoup(data)
print [td.text.strip() for td in soup.find_all('td', class_='price')]
Prints:
[u'US$85.00', u'US$285.00', u'US$1,282.00', u'US$3,333.00']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to find the right variable to screen scraping. - python

Related

Can't fetch links connected to different exhibitors from a webpage

Selenium and BeautifulSoup: sharing and pulling session data resources to multiple libraries in python

Scrape password protected website with no token

Python website login and scrape page (delayed?)

python web scraping with missing source code

Categories

Resources