Scrape password protected website with no token - python

(I'm sorry for my english i'll try to do my best) :
I'm a newbie in python and i'm seeking for help for some web scraping. I already have a functionable code to get the links i want but the website is protected by a password.
with the help of a lot of question i read i managed to get a working code to scrape the website after the login but the links i want are on another page :
the login page is http://fantasy.trashtalk.co/login.php
the landing page (the one i scrape with this code) after login is http://fantasy.trashtalk.co/
and the page i want is http://fantasy.trashtalk.co/?tpl=classement&t=1
So i have this code (some import are probably useless, they come from another code):
from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re
username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"
values = {'email': username,
'password': password}
r = requests.post(log, data=values)
# Not sure about the code below but it works.
data = r.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
I understand that this code only allow me to access to the login page then scrape what come next (the landing page) but i don't figure out how to "save" my loggin info to access the page i want to scrape.
i think i should add something like this after the login code but when i do it it only scrape my links from the login page :
s = request.get(url)
Also i read some topic here using "with session" thing ? but i didn't managed to make it work.
Any of help would be appreciated. Thank you for your time.

The issue was that you needed to save your login credentials by posting them through a session object, not a request object. I've modified your code below and you now have access to the html tags located in the scrape_url page. Good luck!
import requests
from bs4 import BeautifulSoup
username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'
login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}
#Start session.
session = requests.session()
#Login using your authentication information.
session.post(url=login_url, data=login_info)
#Request page you want to scrape.
url = session.get(url=scrape_url)
soup = BeautifulSoup(url.content, 'html.parser')
for link in soup.findAll('a'):
print('\nLink href: ' + link['href'])
print('Link text: ' + link.text)

Related

How can I scrape the results of a search after logging into a website using Python and beautifulsoup4?

I want to log-in to a website, perform a search on a page and then scrape all the results.
I've somehow managed to log-in using Python and requests but when I do a get request on the page I want to perform search or whenever I do a post request on that page with all the search criteria being passed in the body, I'm not really getting any search result. Instead the title still says "Login to page" which is the title of the login-page, so somehow it seems I'm not able to perform any REST operation after logging in. Is there any specific way to scrape the website when it requires one to login and then perform a search?
Following is my attempt:
import requests
from lxml import html
from bs4 import BeautifulSoup
USERNAME = "abcdefgh"
PASSWORD = "xxxxxxx"
LOGIN_URL = "https://www.assortis.com/en/login.asp"
URL = "https://www.assortis.com/en/members/bsc_search.asp?act=sc"
SEARCH_URL = "https://www.assortis.com/en/members/bsc_results.asp"
def scrapeIt():
session_requests = requests.session()
#login
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
# print(tree)
# Create payload
payload = {
"login_name": USERNAME,
"login_pwd": PASSWORD,
"login_btn": "Login"
}
search_payload = {
'mmb_cou_hid': '0,0',
'mmb_don_hid': '0,0',
'mmb_sct_hid': '0,0',
'act': 'contract',
'srch_sdate': '2016-01-01',
'srch_edate': '2018-12-31',
'procurement_type': 'Services',
'srch_budgettype': 'any',
'srch_budget': '',
'srch_query': '',
'srch_querytype': 'all of the words from'
}
# Perform login
result = session_requests.post(LOGIN_URL, data=payload, headers=dict(referer=LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers=dict(referer=URL))
result = session_requests.post(SEARCH_URL, data=search_payload, headers=dict(referer=SEARCH_URL))
content = result.content
# print(content)
data = result.text
soup = BeautifulSoup(data, 'html.parser')
print(soup)
scrapeIt()
EDIT: the webpage is possibly in JavaScript.
Save your response.text to local file after you've logged in, to see if you actually logged in, check the file
otherwise instead of reverse engineering http requests, try Selenium chromedriver.
The logging part is easier on Selenium, but finding stuff on the page is not, use wait for exceptions for dynamic loading, driver.page_source to see the html, sometimes browser write html different, ie: <tbody> tags

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1
#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()
The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

Trying to find the right variable to screen scraping.

I have code written out, have tested the first bit. (The logging into website) but I am trying to add on a screen scraping part into the code and am having a bit of trouble getting the result that I want. When I run the code I get "None" im unsure what is causing this. I think it is due to me maybe not having the right attribute that it is trying to scrape.
import requests
import urllib2
from bs4 import BeautifulSoup
with requests.session() as c:
url = 'https://signin.acellus.com/SignIn/index.html'
USERNAME = 'My user name'
PASSWORD = 'my password'
c.get(url)
login_data = dict(Name=USERNAME, Psswrd=PASSWORD, next='/')
c.post(url, data=login_data, headers={"Referer": "https://www.acellus.com/"})
page = c.get('https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326')
quote_page = 'https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find('div', attrs={'class':'Object7069'})
price = price_box
print price
This is a screenshot of the "inspect element" of the data I want to screen scrape
I don't think using requests and urllib2 to log in is a good idea. There is mechanize module for python2.x using which you could log in through forms and retrieve content. Here is how your code would look like.
import mechanize
from bs4 import BeautifulSoup
# logging in...
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("https://signin.acellus.com/SignIn/index.html")
br.select_form(nr=0)
br['AcellusID'] = 'your username'
br['Password'] = 'your password'
br.submit()
# parsing required information..
quote_page = 'https://admin252.acellus.com/StudentFunctions/progress.html?ClassID=326'
page = br.open(quote_page).read()
soup = BeautifulSoup(page, 'html.parser')
price_box = soup.find('div', attrs={'class':'Object7069'})
price = price_box
print price
Reference link: http://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize/
P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).

Python website login and scrape page (delayed?)

I'm trying to scrape a webpage that's behind a login page.
I know how to login using Python's requests.session().
However, when I retrieve the webpage, it seems to be not fully loaded.
The html I receive is different from the html shown when I login through a browser.
My code is this:
session = requests.session()
login_data = {'email': 'myemailaddress', 'password': 'mypassword'}
session.post(url_login, login_data)
r = session.get(url_homepage)
soup = bs(r.content, 'lxml')
print(soup.prettify())
I'm getting the impression that the site does some scripting or redirecting after the initial loading of url_homepage
I've already tried to put in a time.sleep(10) between the post and the get, but that doesn't do the trick.
I'm guessing I need to have session.get() to wait a number of seconds before it does the actual get, but session.get() doesn't allow that.
Does anybody know how to do this, or can give me suggestions on how to proceed please?
I'm using Python 3.6 but solutions for other versions are ok too.
For example (Linkedin).
You have to download ChromeDriver (or other drivers), look at selenium documentation
import time
from selenium import webdriver
from bs4 import BeautifulSoup
def main():
username = 'my_login'
password = 'my_pass'
linkedin = 'https://www.linkedin.com/uas/login'
#sign in
browser = webdriver.Chrome()
browser.get(linkedin)
browser.find_element_by_name("session_key").send_keys(username)
browser.find_element_by_name("session_password").send_keys(password)
browser.find_element_by_name("signin").click()
time.sleep(3)
#scrape
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
print(soup)
#log out
browser.find_element_by_id("nav-settings__dropdown-trigger").click()
browser.find_element_by_link_text("Sign out").click()
browser.quit()
if __name__ == '__main__':
main()

Not able to parse the 'data' for login

i'm trying to login to my college website using python, i want the source code of the welcome page i.e my dashboard but when i run this i'm getting the same source code as of the login page..is this because im not able to post my info on the login form? here is the code..
import requests
from bs4 import BeautifulSoup
from lxml import html
import collections
url = 'http://erp.college_name.edu/'
opening = requests.get(url)
r = requests.session()
stuff= collections.OrderedDict()
stuff = {
'tbUserName': 'my_username',
'tbPassword': 'my_password',
}
opens = r.post(url=url, data=stuff)
soup = BeautifulSoup(opens.text, 'lxml')
print(soup)
any help?
You're probably not logging in correctly. Ideally, the site will give you a non-200 status code which you can check with opens.status_code. A successful request should start with a 2 (such as 200). Note that some sites won't provide reasonable status codes even if your request isn't correct.
UPDATE
so,after getting the tokens
import collections
url = 'http://erp.name_of_college.edu/'
opening = requests.get(url)
tree = html.fromstring(opening.text)
token = list(set(tree.xpath("//input[#name='name_of_token']/#value")
[0]))
r = requests.session()
data = collections.OrderedDict()
datas = {
'tbUserName': 'my_username',
'tbPassword': 'my_password',
'name_of_token' : token,
}
opens = r.post(url=url, data=datas)
soup = BeautifulSoup(opens.text, 'lxml')
print(soup)
problem is solved , you need to include the tokens in the parsing data , which are generally as named hidden in the class, and if the problem presists then include more data from the form ;)

Categories

Resources